[sword-devel] unicode / utf-8

Chris Little sword-devel@crosswire.org
Thu, 24 May 2001 20:38:57 -0700


> > UTF-8 IS NOT UNICODE!  It supports STORING of unicode.
>
> I thought that all Unicode was 32-bit (at least for the latest
> version), and
> UTF-8 and UTF-16 are two of the defined encoding sequences for Unicode.
> Thus, strictly speaking, only 32-bit chars are Unicode, but UTF-8
> and UTF-16
> can be called Unicode because they're defined by the standard.

This is all kinda knit-picky.  Unicode is just a table that maps numbers to
glyphs.  (Yeah, that's a slight over-simplification.) UTF-8 is an encoding
to allow your up-to 32-bit character to be expressed as 1-6 bytes (none of
which will ever be null unless you're actually expressing null).

> > The question really comes when we try to decide the internal memory
> > storage mechanism of these streams...
> > ...
> > How does searching now work in this new world.

As long as we convert the search string to our searched text encoding
format, shouldn't we be fine?  At least the three simple searches should
work without any ill effects.

> > Lot's of things to consider over the next few weeks as we try to hash
> > out an initial shot at supporting this new range of modules.
>
> Looks like we might need to bring in a character manipulation
> library.  Make
> sure it's GPL-ed!  :-)

No, no, make sure it's BSD. :)

Actually there shouldn't be anything in this domain that we can't write.
It's just transformation of numbers between different tables.

If Martin talks us into using iso8859 and other 8/16-bit encodings to save
space, there are some very nice conversion tables at
http://www.unicode.org/Public/MAPPINGS/.  And it might be nice to provide
mechanisms for this to aid front-ends that have no hope of Unicode support.

--Chris