[mobile-devel] Accented Searching

Fri May 28 04:54:38 MST 2010

On Fri, May 28, 2010 at 7:42 AM, Caleb Maclennan <caleb at alerque.com> wrote:
>
> 2010/5/28 Tóth Tamás <tomika_nospam at freemail.hu>:
> > It's clear that preprocessing the string to be found is not enough in this
> > case. As I see a custom compare algorithm has to be implemented.
>
> Tom,
>
> I don't think you understand how pre-processing text with filters for
> search applies to this problem. It does have it's weaknesses but the
> example you give is exactly the kind of problem it solves gracefully.
>
> Remember that both the data set and the search term are run through
> the same filters. So when you search whether you type in Jónás or
> jonas or JöNâŞ, the engine is going to filter that and be looking
> through the text for jonas. At the same time the text it is searching
> through has been filtered the same way, so ALL instances in the next
> have been normalized to jonas. When results are returned, they can be
> returned from the original text, not the striped / filtered version,
> so the proper accents can be shown in the front-end.
>
> In other words the engine will find all instances of a word even when
> the input and output sides don't match because both the query text and
> the source text have been normalized to the same middle.
>
> The limitations involve things that change the meaning of words and do
> not normalize easily. For example in my language of Turkish there is a
> problem with the letter i and the undotted variant ı. A user searching
> for "kin" might actually want the word "kin" or they might be using a
> keyboard without the ı letter and want to find the word "kın". As you
> might guess these words have entirely different meanings. Basically
> what ends up happening in a strip/filter senario is BOTH words get
> returned all the time and it is impossible to specifically search for
> only one variant. In general this is preferred over not returning
> results at all.
>
> Regards,
> Caleb

In addition to Caleb's excellent advice, you might take a look at
src/modules/filters/utf8*.cpp (eg, utf8arabicpoints.cpp). I don't know
whether you're currently using ICU, but it would probably help in
developing the filters. I see some are using it and some aren't. You
probably will need to write a filter specific to your language, unless
utf8latin1 does the job (I have no idea).

Matthew