[sword-devel] unicode / utf-8

Troy A. Griffitts sword-devel@crosswire.org
Thu, 24 May 2001 18:23:25 -0700


Congrats guys on the UTF-8 / UNICODE support!

A few comments from my experiences over last week.

UNICODE string on windows is an array of 16-bit characters.
UNICODE string on UNIX is an array of 32-bit characters.

UTF-8 IS NOT UNICODE!  It supports STORING of unicode.
UTF-8 is a VARIABLE length storage encoding for 32-bit (at most) streams
of character.
The beauty of UTF-8 is that it only uses 1 byte for character < 128
which is the majority of characters in a roman script.  Storing modules
in UTF-8 encoding would not noticably increase the size for most of our
modules.

The question really comes when we try to decide the internal memory
storage mechanism of these streams...

Do we use char [], short[], or long[]?

If we use char[] is it a byte stream of UTF-8, or 1, 2, or 4 byte
sequences that represent a single character (definable as a module
parameter).

How does searching now work in this new world.

Lot's of things to consider over the next few weeks as we try to hash
out an initial shot at supporting this new range of modules.

                -Troy.