[sword-devel] Unicode

Martin Gruner sword-devel@crosswire.org
Sun, 27 May 2001 19:35:01 +0200


> 1. Encode with UTF-8 whenever possible. (Probably a bad idea.)
> 2. Encode with ISO8859-1 (Latin-1) whenever possible and then UTF-8
> whenever possible if ISO8859-1 won't work, which alleviates the problem
> of accents & umlauts increasing in size.
> 3. Encode with all ISO8859 encodings and similar 8/16-bit encodings
> whenever possible, using UTF-8 as a fallback when possible, which
> alleviates many more module size problems.
>
> The question is how much processing we are willing to do in Sword to
> convert between encodings vs. how large we are willing to allow our
> modules to become.  One thing we have in our favor is that all of these
> modules can be targeted at Sword 1.5+, so we can compress them.  But a
> compressed UTF-8 NA27 is still going to be larger than a NA27 encoded in
> ISO8859-7.

A compressed NA27-iso8859-7 is the best.
My proposal:
-store the modules in whatever encoding you like
-for every encoding, write a encoding->unicode filter and a unicode->encoding 
filter
-handle all strings as unicode internally.
-let the frontend decide which encoding to use for output (e.g. iso-8859-7 
vs. UTF-8)

> The nicest solution may be to allow flexibility for module makers and
> frontend makers by supporting texts encoded in UTF-8, ISO8859-x, etc.
> and translating to the desired encoding, just as we do with different
> markup filters.

Yes. As long as standardized encodings are used which are not dependent upon 
a special font.

> There's a further issue of Unicode's incompleteness.  Harry has
> mentioned there are still some issues with Hebrew support in Unicode
> 3.0.  There are very few fonts even made to support some of the new
> glyphs in Unicode 3.0.  As an example, while making a Peshitta module
> last night, I wanted to convert from a custom font encoding over to
> UTF-8.  Syriac was only added in Unicode 3.0, so I only found one font
> that supports its glyphs.  Even so, it appears that the Syriac
> implementation in Unicode 3.0 may be incomplete for the purposes of this
> text.

Well, this is a serious problem. For modules like this (no proper encoding 
available) it might be necessary to use a font specific encoding 
(encoding=fontspecific) and a specific font. I guess this will be taken care 
of by unicode in future.

> Why does it seem that once we scale a tall mountain, we find an even
> taller mountain waiting behind it to be conquered as well?

Matthew 21:21
And Jesus in answer said to them, Truly I say to you, If you have faith, 
without doubting, not only may you do what has been done to the fig-tree, but 
even if you say to this mountain, Be taken up and put into the sea, it will 
be done.  ;)

Martin