[sword-devel] Unicode questions

DM Smith dmsmith555 at yahoo.com
Wed May 7 03:59:48 MST 2008

On May 7, 2008, at 1:42 AM, Ben Morgan wrote:

> Hi,
> Just a few questions about unicode things.
> Are VerseKeys and TKs UTF-8?
> There seems to be a few problems with some of the modules (I may be  
> wrong, but they don't appear correct to me with my limited knowledge  
> of unicode)
> In LewisElem, a big proportion of keys don't seem to be valid utf-8
> after ABANTIADES, for e.g.
> 'ABCI\xc2\x80\x90DO'

The actual requirement for LD modules is that the keys are strictly  
ordered by their bytes. For Unicode, this will result in a collation  
by code points. For the collation to be consistently meaningful UTF-8  
needs to be normalized.

Earlier this week I discovered that the SWORD engine will ensure that  
the keys are appropriately ordered. The new tei2mod, will normalize  
the keys and data. Since it is new, there may be problems with it.  
Please let us know.

If the module's conf states that the encoding is UTF-8, it is an error  
for the keys and data to be something other than UTF-8. The new  
tei2mod will detect whether an entry is UTF-8 or not. If it is not, it  
will convert it to UTF-8.

> In esv.conf, it uses copyright symbol, but it isn't encoded in utf-8

This is an error. A conf should be encoded the same as the module. In  
sections such as About that allow RTF, escape codes can also be used  
for Unicode.

There are many such problems in the conf's that we have.

> In autenreith, ΙΕΡΌΣ has definition starting with
> ι<*&γτ;ερός, ἷρός:
> Is this really meant to be like this?
> in authenriet, the following entry is in there twice ἌΓΡΙΟΣ
> There also seem to be many other duplicates.
> I also sometimes get the error message:
> ERROR: no buffer to decompress!

Many dictionaries have duplicate keys with different data. The SWORD  
engine can't handle this. So these need to me merged into a single  
entry, or the SWORD engine needs to be modified to handle it.

I am surprised that you see these, I would have thought that the later  
ones would have replaced the earlier ones in the idx file as the  
module was being written.

More information about the sword-devel mailing list