[sword-devel] NFC Normalization and osis2mod

Chris Little chrislit at crosswire.org
Sat Feb 23 05:46:27 MST 2008

DM Smith wrote:
> The thing I noticed in Sword's ICU filters is that it was not consistent 
> in how it set up the UChar array or converted that back to a SWBuf.

Thanks for digging through everything. I will see if I can't make things 
a little more consistent once I get UTF8NFC debugged.

> The setup may be wrong:
>         int32_t len = text.length() * 2;
>         source = new UChar[len + 1];
>         len = ucnv_toUChars(conv, source, len, text.c_str(), -1, &err);

Yes, that's where I'm focusing my attention.

> Many of the filters just use text.length(), one uses text.length()*2+1, 
> another 5+text.length()*5 and only this one uses text.length()*2.

Well, here are some guesses as to what these might have come from....

X+1 is probably making room for a null termination (probably unnecessary 
since everything is null terminated to begin with).

X*2 could be either doubling the byte size to accommodate conversion 
from 8-bit chars to 16-bit chars OR could be acceptance of the fact that 
characters we encounter might actually be represented as surrogate pairs 
in UTF-16. (ICU uses UTF-16 internally.)

X*5 is probably allowing for expansion from a character to its UTF-8 
representation, which is maximally 5-bytes long.

I'll get it all sorted out eventually, but those are what those numbers 
probably represent.

I had a bit of difficulty getting BCB5 installed and working in Vista, 
but I think I've got everything running well enough for the moment so 
that I can get to work on this.


More information about the sword-devel mailing list