[sword-devel] [sword-support] Locales

Sat Sep 13 07:20:26 MST 2008

Thanks Peter,

Yeah, I believe our new modules are normalized with ICU to be standard 
NFC (Normal Form Composed).  Here's an interesting comment regarding Arabic:

http://unicode.org/faq/normalization.html#8

You suggestion about normalizing the search string and also the indexed 
search text of the module is exactly what we do for greek.  You can 
search with or without diacritics and transcription annotation: 
[](),etc. and find results with or without such.

In SWORD there is a concept of 'Strip Filters' which are used to filter 
the text body before sending to the indexer.  These typically remove all 
the markup.  Some modules have extra filters added, by placing an extra 
entry in their .conf file.  And example of these is the papyri 
transcription annotation mentioned above.  You will see the line:

LocalStripFilter=PapyriPlain

added to:

hesychius.conf
phi_chr.conf
ddp.conf

And the SWORD engine has an overloaded SWModule::StripText() method.

Called with no parameters will return the stripped text of the module.
If you supply a const char *buffer, the method will run your
buffer through the same filters as the module uses.

So typically, before sending a search term supplied by a user to the 
search method, a programmer would call StripText on the search term, eg.

SWBuf userSearchTerm = searchEditBox.getText();
userSearchTerm = currentModule.StripText(userSearchTerm);
ListKey results = currentModule.search(userSearchTerm);

If I'm not explaining clearly how this applies... if we decided to add:

localStripFilter=czNormalize

to: czecep.conf

(provided we had a simple filter which decided how to normalize Czech)

Everything should be in place to make things work.

Does this make sense?

	-Troy.

Peter von Kaehne wrote:
> DM and I thought about this a while back wrt some problems we had with  Farsi - essentially there are three scenarios for each diacritic sign - not there, integrated or extra. Modules usually are a mixture of integrated use of diacritics and extra, more or less pure one or the other. 
> 
> Search entries depend heavily on the keyboard available - a German searching on a German keyboard will use umlauts, a German searching on a British keyboard will use ae, ue or oe, someone else searching a German text might well search simply for a, e or u.
> 
> So the best way forward appeared at the time  to normalise both text and search entry and accept the possibility of extraneous results - particularly around latinate scripts. 
> 
> Alternatively - and I think there is a lot of mileage in there - we should/could demand that modules are designed cleanly in terms of diacritics (i.e. only sequential) and rectified whereever there is a problem. Subsequently only the search entries would need to be normalised or even better could be subject to user settings
> 
> Peter
> 
> 
> 
> 
> -------- Original-Nachricht --------
>> Datum: Sat, 13 Sep 2008 08:43:08 +0100
>> Von: "Troy A. Griffitts" <scribe at crosswire.org>
>> An: SWORD Support Volunteers <sword-support at crosswire.org>, refdoc at gmx.net, SWORD Developers\' Collaboration Forum <sword-devel at crosswire.org>
>> Betreff: Re: [sword-support] Locales
> 
>> I would guess if we build lucene indexes for that Bible, the lucene 
>> would search ignoring accents?
>>
>> Or that module is not UTF-8?
>>
>> We have filters that we use on ancient Greek texts that allow searching 
>> regarless of diacritics.  He could add a set for any language, but I'm 
>> not sure if this is the right location to place responsibility.  Maybe 
>> if it was an ICU filter that could work for any language-- like if it's 
>> just a normalization problem.  We could use that one filter for all 
>> Bibles like we do the filter for Greek.
>>
>> Not sure, just thinking out loud.
>>
>> 	-Troy.
>>
>>
>>
>>
>> Peter von Kaehne wrote:
>>> Thanks. this is a known problem which caases a lot of difficulties - in
>> all languages which rely on diacritics.
>>> There is a plan to improve the search facility.
>>>
>>> Peter
>>>
>>> -------- Original-Nachricht --------
>>>> Datum: Fri, 12 Sep 2008 19:57:58 +0200 (CEST)
>>>> An: sword-bugs at crosswire.org
>>>> Betreff: [sword-support] Locales
>>>> Peace and love to my brothers and sisters in Jesus Christ, our Lord,
>> from
>>>> Jan, His weak servant.
>>>>
>>>> I am sorry to inform you about an error in the search engine of The
>> Bible
>>>> Tool. While using Czech the search does not correctly interprets all
>> the
>>>> letters with diacritic, e.g.
>>>>
>>>> while typing the request: 
>>>>
>>>> Nesl svůj kříž
>>>>
>>>>
>> http://www.crosswire.org/study/wordsearchresults.jsp?searchTerm=Nesl+sv%C5%AFj+k%C5%99%C3%AD%C5%BE
>>>> the result says that there is 
>>>>> 0 result in the text of Czech Ekumenicky Cesky preklad<
>>>> even the searched text was copied & pasted directly from it.
>>>>
>>>> I hope, it neads only the minor repair only, while the search gives
>> good
>>>> results while looking for the phrases w/o Czech specific letters 
>>>>
>>>> Wish: the search default is "exact match" hence:
>>>>> Co jsem napsal, napsal< gives result
>>>> but
>>>>> co jsem napsal, napsal< gives 0 result
>>>> As  people use the search to help their poor memory, I wish to realy
>> help
>>>> them with less "censorious" matching criteria. These can be useful in
>> the
>>>> "Advanced search".
>>>>
>>>> God helps to your "Opus Dei"
>>>>
>>>>
>>>> _______________________________________________
>>>> sword-support mailing list
>>>> sword-support at crosswire.org
>