[sword-devel] ICU and internationalization in sword

Joel Mawhorter sword-devel@crosswire.org
Mon, 30 Sep 2002 15:40:49 -0700


Hi all,

I'm writting to get reactions to the idea of making sword dependent on ICU. 
Currently we only have optional dependencies on ICU (at least for 
transliteration but I'm not sure what else). I would like to suggest making 
ICU required for sword. The reason I would like to see this happen is that I 
would like to use functionality in ICU in the searching and indexing code 
(and probably other things in the future). Dealing with strings in a language 
specific way is far from trivial for many operations. For example, doing a 
search for whole words only (e.g. searching for God doesn't return godly) 
isn't too hard just for English but to do this for all languages that are or 
can be supported by sword requires a lot of special logic since punctuation 
and even the concept of what a word is vary so much from language to 
language. Either we can use thirdy party code to do this or someone else or I 
can write this specially for sword. I can't speak for others but I think that 
if I had to write code like this it would likely not be as good as the ICU 
implementation is. Another example of something we need is case insensitive 
searching. Currently this is done with stristr() wich only handles ASCII. ICU 
allows this for any language supported by Unicode. I have already concluded 
that index creation will need to depend on ICU since the hardest part of 
indexing is breaking up a text into words which is different from language to 
langauge. 

Since ICU is well designed (IMO), open source, cross platform and contains 
about everything you could think of for Unicode string handling, the only 
downside I can see to requiring it is the added size requirements for the ICU 
libraries. The default build of the three main ICU 2.2 libraries on my 
machine total about 14 MB and they gzip to about 5.5 MB. For most platforms 
this is not a siginificant size increase. Even downloading over a modem, this 
doesn't add too much download time. For platforms where size is a significant 
issue, sword could be statically linked against ICU so that we only linked in 
the parts of ICU that we needed. 

I think that if we want to eventually have really good support for non-Latin 
based languages in sword we will at some point have to start using a library 
like ICU. I would rather do that now so that I don't have to write a bunch of 
code for searching that I will just throw out later. Another advantage of 
requiring ICU is that the front ends can start using it as well for 
internationalization of the user interfaces. What do you all think about 
this, especially regarding the advantages and disadvantages? Obviously Troy 
has the final say on this one but I thought a open discussion on this would 
be good.

In Christ,

Joel Mawhorter