[sword-devel] IBM's International Components for Unicode

Tue, 26 Jun 2001 16:38:08 -0700

When Bob suggested to me that we use ICU in Sword, my reaction was that it was just too big and didn't offer enough to us to make it work adding.  I think it deserves some further consideration though, and that we should consider adding it in 1.7 if not 1.5.3.

The encoding conversions aren't that important to me except for downgrading to Latin-1 because I believe we should still keep the modules in UTF-8 (and eventually convert those that still remain in other encodings to UTF-8).  But ICU has a lot of other things to offer us, the coolest of which (IMO) are locale information and transliteration.

You can see some of the locale info that ICU contains in its data files at http://oss.software.ibm.com/developerworks/opensource/icu/localeexplorer/.

But definitely check out the demo of their transliterator at http://oss.software.ibm.com/developerworks/opensource/icu/translitdemo.  It needs some work still because it doesn't look like it handles pre-composed or combining characters.

Examples are the transliteration of the RST Genesis 1:1 from "В начале сотворил Бог небо и землю" to "V nachalè sotvorìl Bog nèbo ì zèmlù" or LXX Genesis 1:1 from "εν αρχη εποιησεν ο θεος τον ουρανον και την γην " to "en archē epoiēsen o theos ton ouranon kai tēn gēn".

It claims to do a lot of non-roman scripts like Katakana also, so we might consider transliteration to Latin-1 as means for supporting front-ends that can't support UTF-8.

--Chris