[sword-devel] NFC Normalization and osis2mod
dmsmith555 at yahoo.com
Fri Feb 22 07:27:56 MST 2008
That's fine. I thought that when you suggested that I look at the
icu website to see how things had changed/improved that this was what
you were suggesting. So, I went with the OO example rather than
debugging the existing code. Just let me know if you are offering to
make the change or whether you are just looking. Here is what I saw
before I decided to go OO.
The thing I noticed in Sword's ICU filters is that it was not consistent
in how it set up the UChar array or converted that back to a SWBuf.
The setup may be wrong:
int32_t len = text.length() * 2;
source = new UChar[len + 1];
len = ucnv_toUChars(conv, source, len, text.c_str(), -1, &err);
The ICU documentation does not give the -1 as a value for the input
length. However, this is used in all the other filters.
One of the filters computes len using strlen(text), which should return
the same as text.length().
Many of the filters just use text.length(), one uses text.length()*2+1,
another 5+text.length()*5 and only this one uses text.length()*2.
I'm not at all sure, but my guess is that sizeof(UChar) is 2 bytes and
sizeof(char) is 1 byte. I'd guess that *2 and *5 allow for maximal
expansion. The +1 probably could be +1000.
One of Sword ICU filters stuffs a null with:
But this one does not. From reading the ICU docs, this is not harmful
From what I can tell the call:
unorm_normalize(source, len, UNORM_NFC, 0, target, len, &err);
is interesting in a couple of ways. It passes in len rather than len+1
for the target. So the call cannot null terminate the buffer unless it
results in fewer UChars than the length of the buffer. I think that
passing len+1, the actual length, for source is not a problem.
At the end we had:
len = ucnv_fromUChars(conv, text.getRawData(), text.size(),
target, -1, &err);
In terms of what I saw, none of the corrupted text was too short. This
leads me to believe that len, returned from ucnv_fromUChars was wrong.
The icu documentation does not document a -1 as possible argument for
the target length. This pattern is not used in the other Sword ICU
filters. All of them pass the length of the target, with either
target.length() or the return from a transformation.
Based upon this analysis, I figured that OO's penchant for hiding the
complexities of proper allocation was worth the change. And having yet
another way was no big deal, because no two filters did the same
marshalling to and from swbuf.
Chris Little wrote:
> I'd prefer we don't completely rewrite the NFC filter, like this patch
> does. I realize the ICU tutorials demonstrate the C++ interface, but I'm
> pretty certain it's still just a wrapper around the C interfaces that we
> were using.
> I'll take a look at the filters today.
> Troy A. Griffitts wrote:
>> The patch looks good to me.
>> DM Smith wrote:
>>> I've added a -n flag to osis2mod that will normalize UTF-8 to NFC, which
>>> we've agreed as the standard for UTF-8 modules.
>>> I used Sword's UTF8NFC filter to do the work, but found that it was
>>> buggy with trailing garbage on some verses.
>>> I have created a patch for both
>>> at www.crosswire.org/~dmsmith/nfcPatch.txt
>>> <http://www.crosswire.org/~dmsmith/nfcPatch.txt> and would greatly
>>> appreciate some more testing of it.
>>> My test was fairly trivial. I took an OSIS file with limited UTF-8,
>>> already nfc and ran it through osis2mod with and without the -n flag and
>>> then compared the two files. Before I fixed UTF8NFC there were
>>> differences. After fixing UTF8NFC, there were none.
>>> All that this shows is that it does not corrupt an already good nfc
>>> utf-8 file.
>>> Many thanks in advance.
More information about the sword-devel