[sword-devel] Accented Greek Texts
chrislit at crosswire.org
Tue Sep 18 09:56:31 MST 2007
DM Smith wrote:
> Chris Little wrote:
>> MorphGNT and an updated Tisch, both from morphgnt.org are up in the beta
> Both of these modules use composed UTF-8 characters.
> In April 2005 we had a discussion on whether Greek should be composed or
> decomposed. I don't remember coming to a resolution. Are we going with
I don't know. The source texts came pre-composed, and I thought about
whether I should normalize them differently, but decided to just stick
with the easiest path (the do-nothing path) to completion.
> To summarize, some frontends (including different browers viewing the
> Bible Tool) handled composed better than decomposed. Others did the
> opposite. Font choice had significant impact on the results.
> It was noted that we could have filters for composition or decomposition
> to transform as the frontend needed.
Yeah, we already have NFC & NFKD filters. Maybe we should add NFD? In
any case, they require ICU.
> If we allow for modules to vary with regard to this, could/should we
> have an entry in the conf indicating the normalization? Perhaps with the
> values from NFC, NFD, NFKD, NFKC, FCD?
If we allow variation, yes. But I would suggest we just pick a
normalization (NFD or NFC) and stick with it for all modules.
> Should osis2mod do normalization to an agreed upon normalization?
That wouldn't be a bad idea, but it would require ICU.
> How should a Greek (or any other accented text) be indexed with Lucene.
> Should we index various representations: Fully (de)composed,
> un-accented, transliterated?
> It seems that the frontend needs to know how the index is represented so
> that it can appropriately normalize user input.
> Right now Lucene indexes what it is handed and the user is responsible
> for matching that.
That I can't answer, but I would probably index whatever we standardize
on plus the unaccented version of the same.
More information about the sword-devel