[sword-devel] RE: UTF-8 and new module classes

Chris Little sword-devel@crosswire.org
Wed, 23 May 2001 23:57:18 -0700


> -----Original Message-----
> From: owner-bt-devel@crosswire.org
> [mailto:owner-bt-devel@crosswire.org]On Behalf Of Joachim Ansorg
>
> Qt supports Unicode and also UTF-8. How do we see if a module uses UTF-8?
> We have to know this because we have to use different functions then
> (QSTing::fromUTF8(...) instead of QString::fromLocal8Bit(...)).
> The HTML widget should also support UTF-8 if it was converted using
> QString::fromUTF8()
> If we get information how to recognize UTF8 modules before the
> code freeze we
> might try to implement it.

I added a line "Encoding=UTF-8" to the .conf files for the ChiGU-UTF8 module
on Crosswire, so you can check for that to determine if it is UTF-8.  We'll
assume ASCII as the default value of Encoding.

Eventually, I would like to get any modules with characters that conflict
with UTF-8 (any characters in the range 0x80 to 0xFF) into UTF-8 so we can
do away with the Encoding value also and just accept everything as being
UTF-8.

I should also retract my previous statement that we can get rid of the Font
value because it's just a better idea to have numerous smaller fonts with
the correct range for a module than to have a single huge font able to
display all Unicode glyphs.

> But I have to test it with a module. Where do I get a suitable
> unicode font?

When Troy and I were working on the UTF-8 to RTF filter, we found the page
http://www.cl.cam.ac.uk/~mgk25/unicode.html very useful.  It has lots of
links and includes a good explanation of the UTF-8 encoding scheme (though
hopefully no one else will need to do it from scratch).

It lists a number of pages with fonts, including
http://www.hclrss.demon.co.uk/unicode/fontsbyrange.html.  The range needed
for reading the ChiGu-UTF8 module is the CJK Unified Ideographs range.
Unfortunately it says there is no Unix support.  If you can use Windows
TTFs, then there are fonts that appear to have pretty good support for
glyphs in different ranges: Arial Unicode MS from Microsoft, Bitstream
Cyberbit, and Code2000 (which is $5 shareware).

Good luck.  I look forward to seeing it work and to being able to use some
of Unbound Bible's numerous other UTF-8 texts.

--Chris