[sword-devel] unicode / utf-8

Paul Gear sword-devel@crosswire.org
Fri, 25 May 2001 12:41:55 +1000

> Congrats guys on the UTF-8 / UNICODE support!
> A few comments from my experiences over last week.
> UNICODE string on windows is an array of 16-bit characters.

And Java, FWIW.

> UNICODE string on UNIX is an array of 32-bit characters.
> UTF-8 IS NOT UNICODE!  It supports STORING of unicode.

I thought that all Unicode was 32-bit (at least for the latest version), and
UTF-8 and UTF-16 are two of the defined encoding sequences for Unicode.
Thus, strictly speaking, only 32-bit chars are Unicode, but UTF-8 and UTF-16
can be called Unicode because they're defined by the standard.

> ...
> The question really comes when we try to decide the internal memory
> storage mechanism of these streams...
> ...
> How does searching now work in this new world.
> Lot's of things to consider over the next few weeks as we try to hash
> out an initial shot at supporting this new range of modules.

Looks like we might need to bring in a character manipulation library.  Make
sure it's GPL-ed!  :-)