[sword-devel] Unicode questions

Chris Little chrislit at crosswire.org
Thu May 8 03:46:39 MST 2008



DM Smith wrote:
> On May 7, 2008, at 1:42 AM, Ben Morgan wrote:
> 
>> Hi,
>>
>> Just a few questions about unicode things.
>> Are VerseKeys and TKs UTF-8?

I'm not sure how to answer the question with respect to VerseKeys. 
They're not tied to specific modules, and whether they are UTF-8 or not 
depends on the locale file.

TreeKeys and StrKeys have encodings that match the encoding of the 
module to which they belong. So OSIS and TEI modules (which are all 
encoded in UTF-8) will have UTF-8 keys.

>> There seems to be a few problems with some of the modules (I may be  
>> wrong, but they don't appear correct to me with my limited knowledge  
>> of unicode)

Yes. There are problems in some beta modules. The TEI modules from 
Perseus have received virtually no processing, but they transcoding of 
key values to UTF-8 probably was done by us, so we probably can fix it 
easily.

>> In LewisElem, a big proportion of keys don't seem to be valid utf-8
>> after ABANTIADES, for e.g.
>> 'ABCI\xc2\x80\x90DO'

Yes, that looks like bad UTF-8.

>> In autenreith, ΙΕΡΌΣ has definition starting with
>> ι<*&γτ;ερός, ἷρός:
>> Is this really meant to be like this?

It looks like the transliterator just needs to be adjusted to avoid XML 
entities.

>> in authenriet, the following entry is in there twice ἌΓΡΙΟΣ
>> There also seem to be many other duplicates.
>> I also sometimes get the error message:
>> ERROR: no buffer to decompress!
> 
> Many dictionaries have duplicate keys with different data. The SWORD  
> engine can't handle this. So these need to me merged into a single  
> entry, or the SWORD engine needs to be modified to handle it.
> 
> I am surprised that you see these, I would have thought that the later  
> ones would have replaced the earlier ones in the idx file as the  
> module was being written.

When multiple entries with the same key value are added, they are all 
present. Depending on how the binary search algorithm works out, you 
could get any of the entries (not necessarily consistently the first or 
the last entry added).

Historically, we've just added a " (2)" after the second instance of a 
key, and so forth. We also have the option of concatenating subsequent 
identical-key entries to the first, like we do with VerseKey modules.

--Chris



More information about the sword-devel mailing list