[sword-devel] Accented Greek Texts

Tue Sep 18 13:09:44 MST 2007

Chris Little wrote:
> DM Smith wrote:
>   
>> Chris Little wrote:
>>     
>>> MorphGNT and an updated Tisch, both from morphgnt.org are up in the beta 
>>> area.
>>>   
>>>       
>> Both of these modules use composed UTF-8 characters.
>>
>> In April 2005 we had a discussion on whether Greek should be composed or 
>> decomposed. I don't remember coming to a resolution. Are we going with 
>> composed?
>>     
>
> I don't know. The source texts came pre-composed, and I thought about 
> whether I should normalize them differently, but decided to just stick 
> with the easiest path (the do-nothing path) to completion.
>
>   
>> To summarize, some frontends (including different browers viewing the 
>> Bible Tool) handled composed better than decomposed. Others did the 
>> opposite. Font choice had significant impact on the results.
>>
>> It was noted that we could have filters for composition or decomposition 
>> to transform as the frontend needed.
>>     
>
> Yeah, we already have NFC & NFKD filters. Maybe we should add NFD? In 
> any case, they require ICU.
>   
>> If we allow for modules to vary with regard to this, could/should we 
>> have an entry in the conf indicating the normalization? Perhaps with the 
>> values from NFC, NFD, NFKD, NFKC, FCD?
>>     
>
> If we allow variation, yes. But I would suggest we just pick a 
> normalization (NFD or NFC) and stick with it for all modules.
>   

So would I. I vote for NFC as it is a bit more compact. And these 
modules would work as is:)
>   
>> Should osis2mod do normalization to an agreed upon normalization?
>>     
>
> That wouldn't be a bad idea, but it would require ICU.
>   

I don't think that requiring ICU for osis2mod is onerous. After all it 
is just a utility and not a front end.

>   
>> How should a Greek (or any other accented text) be indexed with Lucene. 
>> Should we index various representations: Fully (de)composed, 
>> un-accented, transliterated?
>>
>> It seems that the frontend needs to know how the index is represented so 
>> that it can appropriately normalize user input.
>>
>> Right now Lucene indexes what it is handed and the user is responsible 
>> for matching that.
>>     
>
> That I can't answer, but I would probably index whatever we standardize 
> on plus the unaccented version of the same.

I'm looking into whether the Lucene indexes for accented characters have 
any other problems. According to Lucene's documentation the created 
indexes are case insensitive, but I'm not so sure that this is true for 
accented characters.