[sword-devel] character encoding conversion

David Burry sword-devel@crosswire.org
Tue, 12 Jun 2001 09:36:06 -0700

Most higher level languages have some sort of hash or associative array 
built in, perhaps there are a few libraries somewhere for C to do this even 
more efficiently since all keys and values are the same length (two bytes) 
from UCS16 to SJIS?  I assume a simple calculation and 14k array will work 
from SJIS to UCS16...  In addition, aren't there already lots of Unicode 
conversion libraries out there we could link against?  There are literally 
dozens of conversions to/from Unicode I don't know if we should be 
maintaining all the tables ourselves...


At 10:19 PM 6/11/2001 -0700, Chris Little wrote:
>I realize new versions of BibleCS and BibleTime are just about to go out
>the door, so I'm not suggesting the following feature be added to those
>front ends before they ship.  We're always going to be in development
>and there may always be a few modules that aren't supported by the
>current released version, but that's a good sign that we're pushing
>ourselves along very quickly.
>As I briefly mentioned, I have a Japanese Bible encoded as Shift-JIS.  I
>think it is advantageous to keep this encoding because it is smaller
>than UTF-8 for Japanese.  When the module is read, it can either be left
>in Shift-JIS or converted to UTF-8 for presentation.
>I wrote a couple new functions and a new class to do this.  The new
>functions are UTF32to8 and UTF8to32 and they just convert between a
>UTF-32 long int and a UTF-8 6 char array.  The new class is derived from
>SWFilter and is called SJISUTF8.  It converts SJIS to UTF-32 and then to
>UTF-8.  I have come upon a problem, though.
>There is no real correlation between SJIS and Unicode, so the conversion
>between the two requires a huge lookup table.  (There are about 7000
>code points in SJIS.)  I implemented this as a single big switch.  The
>resulting class takes forever to compile and has a huge file size.
>Does anyone have a suggestion for a better way to store this lookup
>table, which does nothing but correlate 7000 shorts with 7000 different
>Per-character encoding filters should be a lot of fun.  We can do all
>kinds of stuff like transliteration and reducing module sizes through
>regional encoding systems while maintaining Unicode compliance of the
>end result.