[sword-devel] Unicode questions
chrislit at crosswire.org
Thu May 8 03:46:39 MST 2008
DM Smith wrote:
> On May 7, 2008, at 1:42 AM, Ben Morgan wrote:
>> Just a few questions about unicode things.
>> Are VerseKeys and TKs UTF-8?
I'm not sure how to answer the question with respect to VerseKeys.
They're not tied to specific modules, and whether they are UTF-8 or not
depends on the locale file.
TreeKeys and StrKeys have encodings that match the encoding of the
module to which they belong. So OSIS and TEI modules (which are all
encoded in UTF-8) will have UTF-8 keys.
>> There seems to be a few problems with some of the modules (I may be
>> wrong, but they don't appear correct to me with my limited knowledge
>> of unicode)
Yes. There are problems in some beta modules. The TEI modules from
Perseus have received virtually no processing, but they transcoding of
key values to UTF-8 probably was done by us, so we probably can fix it
>> In LewisElem, a big proportion of keys don't seem to be valid utf-8
>> after ABANTIADES, for e.g.
Yes, that looks like bad UTF-8.
>> In autenreith, Î™Î•Î¡ÎŸÌÎ£ has definition starting with
>> Î¹<*&Î³Ï„;ÎµÏÎ¿ÌÏ‚, Î¹Ì”Í‚ÏÎ¿ÌÏ‚:
>> Is this really meant to be like this?
It looks like the transliterator just needs to be adjusted to avoid XML
>> in authenriet, the following entry is in there twice Î‘Ì“ÌÎ“Î¡Î™ÎŸÎ£
>> There also seem to be many other duplicates.
>> I also sometimes get the error message:
>> ERROR: no buffer to decompress!
> Many dictionaries have duplicate keys with different data. The SWORD
> engine can't handle this. So these need to me merged into a single
> entry, or the SWORD engine needs to be modified to handle it.
> I am surprised that you see these, I would have thought that the later
> ones would have replaced the earlier ones in the idx file as the
> module was being written.
When multiple entries with the same key value are added, they are all
present. Depending on how the binary search algorithm works out, you
could get any of the entries (not necessarily consistently the first or
the last entry added).
Historically, we've just added a " (2)" after the second instance of a
key, and so forth. We also have the option of concatenating subsequent
identical-key entries to the first, like we do with VerseKey modules.
More information about the sword-devel