[bt-devel] RE: UTF-8 and new module classes

Troy A. Griffitts bt-devel@crosswire.org
Thu, 24 May 2001 18:22:14 -0700


Congrats guys on the UTF-8 / UNICODE support!

A few comments from my experiences over last week.

UNICODE string on windows is an array of 16-bit characters.
UNICODE string on UNIX is an array of 32-bit characters.

UTF-8 IS NOT UNICODE!  It supports STORING of unicode.
UTF-8 is a VARIABLE length storage encoding for 32-bit (at most) streams
of character.
The beauty of UTF-8 is that it only uses 1 byte for character < 128
which is the majority of characters in a roman script.  Storing modules
in UTF-8 encoding would not noticably increase the size for most of our
modules.

The question really comes when we try to decide the internal memory
storage mechanism of these streams...

Do we use char [], short[], or long[]?

If we use char[] is it a byte stream of UTF-8, or 1, 2, or 4 byte
sequences that represent a single character (definable as a module
parameter).

How does searching now work in this new world.

Lot's of things to consider over the next few weeks as we try to hash
out an initial shot at supporting this new range of modules.

		-Troy.






Martin Gruner wrote:
> 
> Hi Joachim,
> 
> > I think UTF-8 is a standard. Wouldn't it be better to have all modules
> > available in UTF-8 so all the fonts problems go away?
> 
> Yes and no. UTF-8 is just not necessary for the majority of modules. They
> will use twice the size since each character is 2 Byte. And there might be
> frontends which will not be able to display unicode at all. (e.g. irenaeus)
> 
> But: If the modules are encoded with the correct language specific encodings
> they are still 1 Byte, and it is just very easy to map these encodings into
> the UTF-8 unicode encoding. So we could internally work with unicode while
> other apps do not have to, and the modules are still small.
> The point is that the modules should be rebuilt using those iso8859-x
> encodings, which is _much_ better than just encoding with some fontspecific
> ascii encoding, which we can not map into unicode.
> 
> I wonder how searching in unicode modules works. Does sword now internally
> use unicode?
> 
> Martin
> 
> > > I favor moving from the font= tag to an encoding= tag. This way we'd not
> > > have to use huge fonts, but still the flexibility to let the user choose
> > > his/her font. E.g. encoding=iso8859-7 would define greek text. You can
> > > then just display this text with a 1 Byte iso8859-7 font or map it into
> > > unicode for different purposes.
> > > IMO using standards is always a good way to go.
> > > We could implement some mapping filters in sword which map from
> > > fontspecific ascii encodings to the correct language specific encodings
> > > (Like a bstgreek2iso8859-7 filter) to also support frontends favoring the
> > > font= solution.
> > >
> > > Some good links I want to recommend to you:
> > > http://czyborra.com/
> > > http://czyborra.com/charsets/iso8859.html
> > > http://czyborra.com/charsets/cyrillic.html
> > >
> > > Martin