[sword-devel] dictionary ordering revisited

Thu Mar 19 08:54:52 MST 2009

Daniel Owens wrote:
> I'm working on a dictionary with keys based on the lemma in the 
> MorphGNT module (which so far has no dictionary support). I am running 
> into two problems:
>
> 1. Since the keys are polytonic Greek, the byte ordering method of 
> creating the index totally destroys the ordering of the dictionary 
> (same problem with Vietnamese). BPBible does a reasonable job of 
> dictionary lookup (so far the only front-end coming near to supporting 
> dictionary keys in polytonic Greek), but many obvious lookups like 
> anthropos are thrown off because the words starting with alpha are 
> separated in groups spread out over several places in the index. 
> Looking back in the archives, I saw a comment from Troy from October 
> 2007: "Generating a secondary index on a lexdict which preserves some 
> other order and alternate key is great idea and an easy addition to 
> the current code." Has anything been done with this?
>
> 2. The use of upper case for the display of keys in front-ends is 
> totally unnatural. Can I plead that something be done about this? 
> Surely it is an easy fix, or is it more than a display issue. Not 
> fixing it makes SWORD totally unfriendly for Koine Greek students...

I was going to ask the same thing today, as I was looking at the wiki 
for TEI dictionaries.

The problem is a bit deeper than that.

Chris has pointed out that byte ordering and code point ordering of 
UTF-8 are the same.

The first problem is that of normalization of the keys in the module. 
This has several aspects.
1) In UTF-8, several different code point sequences can result in the 
same glyphs. We have chosen to use ICU's NFC normalization. tei2mod does 
normalization the other LD module creators (e.g. imp2ld) don't.
2) As you noted UPPER CASE keys are ugly. Some are unreadable (e.g. 
multiply accented capital Greek letters). Worse than that some languages 
don't have upper case representations of lower case letters. I haven't 
heard of any, but the reverse might be a problem. And others do, but it 
is not yet represented in Unicode (e.g. Cherokee).
3) Normalization can result in an odd ordering for end users. In some 
languages the ordering of code points is not proper. For example, German 
dictionaries, Spanish dictionaries and French dictionaries differ with 
respect to how they order accented characters. ICU supplies collation 
keys on a per language basis for this.

The second problem is that of normalizing the search request. This has 
several aspects.
1) The search request has to be normalized in exactly the same fashion 
as the creation of the module. Using the same technique, but a different 
normalizer might result in a different normalization. It may be that a 
minimum version ICU is necessary. (Hopefully, later versions are 
backward compatible.)
2) User input may ignore accents. (e.g. do a dictionary lookup from a 
Greek text that lacks accents, or from a Hebrew text that has vowel 
points off). Or they may enter a transliteration (e.g. use oikos to 
lookup house in a polytonic Greek dictionary).

I think a solution can be layered on top of the module as it is today. 
Basically, one or more secondary indexes are used to do lookup in the 
first. Maybe one is with accents and another without. Lucene can be used 
easily to create a single lookup with multiple fields where each field 
is a different representation of the key.

I would like to see a solution that is part of the module or a part of 
the SWORD engine.

In Him,
    DM