[sword-devel] DevTools:ICU & Normalization?

Wed Oct 12 08:48:28 MST 2011

David,

SWORD can link against many different versions of the ICU library. It
will detect the version that is installed on the system and leverage
its internal libraries. I know it supports back at least as far as ICU
4.0 which was Unicode 5.1. It also compiles against ICU 4.8 - which
supports Unicode 6.0 - as well.  Whether it supports anything before
ICU 4 I am not certain, as I have not tried with earlier versions
anytime recently.

Whatever is present on a system will be utilized. I thought
normalizing was done at data retrieval time, which would mean whatever
is present on the user's system will be used. If it's done at import
time then it will be whatever version of Unicode is on Chris Little's
system. I would imagine that it is at least later than 4.0 as that
version is dated to January 2009.

--Greg

On Wed, Oct 12, 2011 at 10:29 AM, David Haslam <dfhmch at googlemail.com> wrote:
> According to http://crosswire.org/wiki/DevTools:ICU - Sword makes use of ICU
> for casing (used in search), normalization, and script transliteration.
>
> *Which version of Unicode do we employ for Normalization to NFC ?*
>
> Some composite glyphs that use two combining characters in the *Myanmar*
> block are treated differently when specifying the current version of Unicode
> than they were for Unicode 3.2.
>
> These are the two combining characters.  They have UNC codes U+1037 U+103A.
>
> ့ MYANMAR SIGN DOT BELOW
> ် MYANMAR SIGN ASAT
>
> This pair of combining characters occurs many, many times in the BurJudson
> module.
>
> Software that includes Normalization should be tested against the official
> Unicode Normalization Test
> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt (2.2MB) for that
> version of Unicode,
>
> Testing the normalization of the sequence U+1000 U+103A U+1037 with the ICU
> Normalization Browser (which uses the "Internationalization Components for
> Unicode" library, which is the most widely used Unicode software library),
> we can verify that it does indeed normalize to U+1000 U+1037 U+103A, with
> reordering:
>
> See http://bit.ly/nqYzQp.
>
> However, if you run the same test for Unicode 3.2 (released March 2002, and
> so almost 10 years out of date), there is no reordering:
>
> See http://bit.ly/orZ7df.
>
> /NB. I used the URL shortener to allow parameters to be passed to the test
> page more easily/.
>
> The process of converting a string to NFC or NFD requires a stage called
> "canonical ordering", whereby characters are reordered in ascending order
> according to their canonical combining class [ccc]. See
> http://www.unicode.org/reports/tr15/?win#Description_Norm.
>
> U+103A MYANMAR SIGN ASAT has ccc=9, whereas U+1037 MYANMAR SIGN DOT BELOW
> has ccc=7; therefore U+1037 is reordered before U+103A.
>
> The present module BurJudson has SwordVersionDate=2008-03-01.
> It looks very much as if the normalization was done according to Unicode
> 3.2.
>
> Context:
> This question arises in the context of the possibility of creating a new
> module from a better source text.
> If we use the latest SWORD utilities to make the new module, will it
> normalize correctly?
>
> David
>
> --
> View this message in context: http://sword-dev.350566.n4.nabble.com/DevTools-ICU-Normalization-tp3898398p3898398.html
> Sent from the SWORD Dev mailing list archive at Nabble.com.
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page