[sword-devel] for the love of unicode

Chris Little sword-devel@crosswire.org
Fri, 15 Jun 2001 15:39:04 -0700

A couple comments on our unicode support...

I committed a couple utility functions for translating between
UTF-32/UCS4 and UTF-8 in the files swunicode.h/cpp.  The might be useful
to any of you working on adding unicode support, but they're definitely
needed for per-character encoding filters.

The per-character encoding filters are not going too well.  I tried an
STL map and after about 15 minutes, gcc crashed.  So I tried a simple
array, even though it would have to be 21000 shorts in size despite
containing only 7000 actual codepoints.  After about 15 minutes, gcc
crashed with that too.  So that leaves me with a giant switch as the
only option I can think of that actually works.  Breaking it up into a
high-byte switch and numerous smaller low-byte switches might ease the
compiler's workload somewhat and simplify the jump tables, but I don't
know anything about compilers so I'm just guessing.

The alternative is to dump the filters altogether.  Is it worth it to
have numerous such very large additions to the executables for modules
that will probably be used by very few?  Converting this text in
x-euc-jp encoding to UTF-8 would, I estimate, make it 1.5 times as
large.  (x-euc-jp is 2-bytes per character, UTF-8 for those codepages is
probably 3-bytes.)  Those who want these translations will have larger
modules to download, but they will be faster to use since there won't be
any encoding conversion.  Everyone will benefit from smaller executable
sizes and none of the front ends that support UTF-8 will need to do
anything additional to support the new texts.

We also need to decide what to do about modules currently in the Symbol
font/encoding.  Do we turn them into UTF-8 and double their size?  Or do
we write per-character encoding filters for them--a much simpler task
for encodings limited to 255 codepoints.