[sword-devel] NFC Normalization and osis2mod

Fri Feb 22 07:27:56 MST 2008

Chris,
    That's fine. I thought that when you suggested that I look at the 
icu website to see how things had changed/improved that this was what 
you were suggesting. So, I went with the OO example rather than 
debugging the existing code. Just let me know if you are offering to 
make the change or whether you are just looking. Here is what I saw 
before I decided to go OO.

The thing I noticed in Sword's ICU filters is that it was not consistent 
in how it set up the UChar array or converted that back to a SWBuf.

The setup may be wrong:
        int32_t len = text.length() * 2;
        source = new UChar[len + 1];
        len = ucnv_toUChars(conv, source, len, text.c_str(), -1, &err);

The ICU documentation does not give the -1 as a value for the input 
length. However, this is used in all the other filters.
One of the filters computes len using strlen(text), which should return 
the same as text.length().
Many of the filters just use text.length(), one uses text.length()*2+1, 
another 5+text.length()*5 and only this one uses text.length()*2.

I'm not at all sure, but my guess is that sizeof(UChar) is 2 bytes and 
sizeof(char) is 1 byte. I'd guess that *2 and *5 allow for maximal 
expansion. The +1 probably could be +1000.

One of Sword ICU filters stuffs a null with:
source[len]=0;
But this one does not. From reading the ICU docs, this is not harmful 
nor necessary.

 From what I can tell the call:
    unorm_normalize(source, len, UNORM_NFC, 0, target, len, &err);
is interesting in a couple of ways. It passes in len rather than len+1 
for the target. So the call cannot null terminate the buffer unless it 
results in fewer UChars than the length of the buffer. I think that 
passing len+1, the actual length, for source is not a problem.

At the end we had:
           text.setSize(text.size()*2);
           len = ucnv_fromUChars(conv, text.getRawData(), text.size(), 
target, -1, &err);
           text.setSize(len);

In terms of what I saw, none of the corrupted text was too short. This 
leads me to believe that len, returned from ucnv_fromUChars was wrong. 
The icu documentation does not document a -1 as possible argument for 
the target length. This pattern is not used in the other Sword ICU 
filters. All of them pass the length of the target, with either 
target.length() or the return from a transformation.

Based upon this analysis, I figured that OO's penchant for hiding the 
complexities of proper allocation was worth the change. And having yet 
another way was no big deal, because no two filters did the same 
marshalling to and from swbuf.

DM
Chris Little wrote:
> I'd prefer we don't completely rewrite the NFC filter, like this patch 
> does. I realize the ICU tutorials demonstrate the C++ interface, but I'm 
> pretty certain it's still just a wrapper around the C interfaces that we 
> were using.
>
> I'll take a look at the filters today.
>
> --Chris
>
>
> Troy A. Griffitts wrote:
>   
>> DM,
>>
>> The patch looks good to me.
>>
>> DM Smith wrote:
>>     
>>> I've added a -n flag to osis2mod that will normalize UTF-8 to NFC, which 
>>> we've agreed as the standard for UTF-8 modules.
>>>
>>> I used Sword's UTF8NFC filter to do the work, but found that it was 
>>> buggy with trailing garbage on some verses.
>>>
>>> I have created a patch for both 
>>> at www.crosswire.org/~dmsmith/nfcPatch.txt 
>>> <http://www.crosswire.org/~dmsmith/nfcPatch.txt> and would greatly 
>>> appreciate some more testing of it.
>>>
>>> My test was fairly trivial. I took an OSIS file with limited UTF-8, 
>>> already nfc and ran it through osis2mod with and without the -n flag and 
>>> then compared the two files. Before I fixed UTF8NFC there were 
>>> differences. After fixing UTF8NFC, there were none.
>>>
>>> All that this shows is that it does not corrupt an already good nfc 
>>> utf-8 file.
>>>
>>> Many thanks in advance.
>>>
>>> DM
>>>