[sword-devel] character encoding conversion

Chris Little sword-devel@crosswire.org
Mon, 11 Jun 2001 22:19:13 -0700


I realize new versions of BibleCS and BibleTime are just about to go out
the door, so I'm not suggesting the following feature be added to those
front ends before they ship.  We're always going to be in development
and there may always be a few modules that aren't supported by the
current released version, but that's a good sign that we're pushing
ourselves along very quickly.

As I briefly mentioned, I have a Japanese Bible encoded as Shift-JIS.  I
think it is advantageous to keep this encoding because it is smaller
than UTF-8 for Japanese.  When the module is read, it can either be left
in Shift-JIS or converted to UTF-8 for presentation.

I wrote a couple new functions and a new class to do this.  The new
functions are UTF32to8 and UTF8to32 and they just convert between a
UTF-32 long int and a UTF-8 6 char array.  The new class is derived from
SWFilter and is called SJISUTF8.  It converts SJIS to UTF-32 and then to
UTF-8.  I have come upon a problem, though.

There is no real correlation between SJIS and Unicode, so the conversion
between the two requires a huge lookup table.  (There are about 7000
code points in SJIS.)  I implemented this as a single big switch.  The
resulting class takes forever to compile and has a huge file size.
So....

Does anyone have a suggestion for a better way to store this lookup
table, which does nothing but correlate 7000 shorts with 7000 different
shorts?

Per-character encoding filters should be a lot of fun.  We can do all
kinds of stuff like transliteration and reducing module sizes through
regional encoding systems while maintaining Unicode compliance of the
end result.

--Chris