[sword-devel] [sword-support] Locales
dmsmith555 at yahoo.com
Sat Sep 13 10:58:04 MST 2008
Some observations I've made regarding Lucene (that may also apply to
any other search engine):
The index and the search request must be normalized in the same
fashion. There are several aspects to normalization:
The same Lucene analyzer that is used to build the index needs to be
used to prepare the user input for search. The responsibility of a
Lucene analyzer is to do data normalization and tokenization for both
indexing and search. However, Lucene does not normalize the input
While our new modules are NFC, there is nothing to say whether our
older modules are NFC or not. When the index is built it is important
to know what is stored and how it is stored. I.e. whether it is UTF-8
or cp1252, and if UTF-8 whether it is NFC. And it is important to know
whether the diacritics are removed or not. Until we have deterministic
knowledge of this, we cannot normalize the search request to match the
index. And if the two don't match searches will give wrong results.
The user's search request needs to be normalized in exactly the same
fashion as the index. Generally the user will input decomposed UTF-8,
that is they will enter a letter and then the diacritics. When there
are more than one diacritic they can generally be in any order. If the
user is cutting and pasting from Latin-1 and searching UTF-8, (or visa
versa) that's a problem too. The other thing Peter pointed out is that
some user input is language dependent such as the ae, oe, ue for
umlauted a, o and e.
Lucene's StandardAnalyzer is appropriate for English, but not for
other languages as it uses English stop words, English rules for
acronyms, etc. And for Thai and other languages that don't use spaces
to separate words a different "break iterator" is needed. Ultimately,
each language needs its own analyzer.
The generally recommended way to index diacritical text:
Normalize to a known encoding (e.g. UTF-8, NFC) and store it in a
field in multiple forms, e.g.:
Alternate language dependent forms. e.g. stemmed, umlauts expanded,
compound words separated, ....
The trick here is that these all have the same position increment.
On Sep 13, 2008, at 10:20 AM, Troy A. Griffitts wrote:
> Thanks Peter,
> Yeah, I believe our new modules are normalized with ICU to be standard
> NFC (Normal Form Composed). Here's an interesting comment regarding
> You suggestion about normalizing the search string and also the
> search text of the module is exactly what we do for greek. You can
> search with or without diacritics and transcription annotation:
> (),etc. and find results with or without such.
> In SWORD there is a concept of 'Strip Filters' which are used to
> the text body before sending to the indexer. These typically remove
> the markup. Some modules have extra filters added, by placing an
> entry in their .conf file. And example of these is the papyri
> transcription annotation mentioned above. You will see the line:
> added to:
> And the SWORD engine has an overloaded SWModule::StripText() method.
> Called with no parameters will return the stripped text of the module.
> If you supply a const char *buffer, the method will run your
> buffer through the same filters as the module uses.
> So typically, before sending a search term supplied by a user to the
> search method, a programmer would call StripText on the search term,
> SWBuf userSearchTerm = searchEditBox.getText();
> userSearchTerm = currentModule.StripText(userSearchTerm);
> ListKey results = currentModule.search(userSearchTerm);
> If I'm not explaining clearly how this applies... if we decided to
> to: czecep.conf
> (provided we had a simple filter which decided how to normalize Czech)
> Everything should be in place to make things work.
> Does this make sense?
> Peter von Kaehne wrote:
>> DM and I thought about this a while back wrt some problems we had
>> with Farsi - essentially there are three scenarios for each
>> diacritic sign - not there, integrated or extra. Modules usually
>> are a mixture of integrated use of diacritics and extra, more or
>> less pure one or the other.
>> Search entries depend heavily on the keyboard available - a German
>> searching on a German keyboard will use umlauts, a German searching
>> on a British keyboard will use ae, ue or oe, someone else searching
>> a German text might well search simply for a, e or u.
>> So the best way forward appeared at the time to normalise both
>> text and search entry and accept the possibility of extraneous
>> results - particularly around latinate scripts.
>> Alternatively - and I think there is a lot of mileage in there - we
>> should/could demand that modules are designed cleanly in terms of
>> diacritics (i.e. only sequential) and rectified whereever there is
>> a problem. Subsequently only the search entries would need to be
>> normalised or even better could be subject to user settings
>> -------- Original-Nachricht --------
>>> Datum: Sat, 13 Sep 2008 08:43:08 +0100
>>> Von: "Troy A. Griffitts" <scribe at crosswire.org>
>>> An: SWORD Support Volunteers <sword-support at crosswire.org>, refdoc at gmx.net
>>> , SWORD Developers\' Collaboration Forum <sword-devel at crosswire.org>
>>> Betreff: Re: [sword-support] Locales
>>> I would guess if we build lucene indexes for that Bible, the lucene
>>> would search ignoring accents?
>>> Or that module is not UTF-8?
>>> We have filters that we use on ancient Greek texts that allow
>>> regarless of diacritics. He could add a set for any language, but
>>> not sure if this is the right location to place responsibility.
>>> if it was an ICU filter that could work for any language-- like if
>>> just a normalization problem. We could use that one filter for all
>>> Bibles like we do the filter for Greek.
>>> Not sure, just thinking out loud.
>>> Peter von Kaehne wrote:
>>>> Thanks. this is a known problem which caases a lot of
>>>> difficulties - in
>>> all languages which rely on diacritics.
>>>> There is a plan to improve the search facility.
>>>> -------- Original-Nachricht --------
>>>>> Datum: Fri, 12 Sep 2008 19:57:58 +0200 (CEST)
>>>>> An: sword-bugs at crosswire.org
>>>>> Betreff: [sword-support] Locales
>>>>> Peace and love to my brothers and sisters in Jesus Christ, our
>>>>> Jan, His weak servant.
>>>>> I am sorry to inform you about an error in the search engine of
>>>>> Tool. While using Czech the search does not correctly interprets
>>>>> letters with diacritic, e.g.
>>>>> while typing the request:
>>>>> Nesl svůj kříž
>>>>> the result says that there is
>>>>>> 0 result in the text of Czech Ekumenicky Cesky preklad<
>>>>> even the searched text was copied & pasted directly from it.
>>>>> I hope, it neads only the minor repair only, while the search
>>>>> results while looking for the phrases w/o Czech specific letters
>>>>> Wish: the search default is "exact match" hence:
>>>>>> Co jsem napsal, napsal< gives result
>>>>>> co jsem napsal, napsal< gives 0 result
>>>>> As people use the search to help their poor memory, I wish to
>>>>> them with less "censorious" matching criteria. These can be
>>>>> useful in
>>>>> "Advanced search".
>>>>> God helps to your "Opus Dei"
>>>>> sword-support mailing list
>>>>> sword-support at crosswire.org
> sword-devel mailing list: sword-devel at crosswire.org
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel