[sword-devel] Accented Greek Texts

Tue Sep 18 09:56:31 MST 2007

DM Smith wrote:
> Chris Little wrote:
>> MorphGNT and an updated Tisch, both from morphgnt.org are up in the beta 
>> area.
>>   
> Both of these modules use composed UTF-8 characters.
> 
> In April 2005 we had a discussion on whether Greek should be composed or 
> decomposed. I don't remember coming to a resolution. Are we going with 
> composed?

I don't know. The source texts came pre-composed, and I thought about 
whether I should normalize them differently, but decided to just stick 
with the easiest path (the do-nothing path) to completion.

> To summarize, some frontends (including different browers viewing the 
> Bible Tool) handled composed better than decomposed. Others did the 
> opposite. Font choice had significant impact on the results.
> 
> It was noted that we could have filters for composition or decomposition 
> to transform as the frontend needed.

Yeah, we already have NFC & NFKD filters. Maybe we should add NFD? In 
any case, they require ICU.

> If we allow for modules to vary with regard to this, could/should we 
> have an entry in the conf indicating the normalization? Perhaps with the 
> values from NFC, NFD, NFKD, NFKC, FCD?

If we allow variation, yes. But I would suggest we just pick a 
normalization (NFD or NFC) and stick with it for all modules.

> Should osis2mod do normalization to an agreed upon normalization?

That wouldn't be a bad idea, but it would require ICU.

> How should a Greek (or any other accented text) be indexed with Lucene. 
> Should we index various representations: Fully (de)composed, 
> un-accented, transliterated?
> 
> It seems that the frontend needs to know how the index is represented so 
> that it can appropriately normalize user input.
> 
> Right now Lucene indexes what it is handed and the user is responsible 
> for matching that.

That I can't answer, but I would probably index whatever we standardize 
on plus the unaccented version of the same.

--Chris