[sword-devel] NFC Normalization and osis2mod
chrislit at crosswire.org
Sat Feb 23 05:46:27 MST 2008
DM Smith wrote:
> The thing I noticed in Sword's ICU filters is that it was not consistent
> in how it set up the UChar array or converted that back to a SWBuf.
Thanks for digging through everything. I will see if I can't make things
a little more consistent once I get UTF8NFC debugged.
> The setup may be wrong:
> int32_t len = text.length() * 2;
> source = new UChar[len + 1];
> len = ucnv_toUChars(conv, source, len, text.c_str(), -1, &err);
Yes, that's where I'm focusing my attention.
> Many of the filters just use text.length(), one uses text.length()*2+1,
> another 5+text.length()*5 and only this one uses text.length()*2.
Well, here are some guesses as to what these might have come from....
X+1 is probably making room for a null termination (probably unnecessary
since everything is null terminated to begin with).
X*2 could be either doubling the byte size to accommodate conversion
from 8-bit chars to 16-bit chars OR could be acceptance of the fact that
characters we encounter might actually be represented as surrogate pairs
in UTF-16. (ICU uses UTF-16 internally.)
X*5 is probably allowing for expansion from a character to its UTF-8
representation, which is maximally 5-bytes long.
I'll get it all sorted out eventually, but those are what those numbers
I had a bit of difficulty getting BCB5 installed and working in Vista,
but I think I've got everything running well enough for the moment so
that I can get to work on this.
More information about the sword-devel