[sword-devel] XML Numeric character references (entities) in BibleCS

Chris Little chrislit at crosswire.org
Thu Jan 31 17:37:01 MST 2008


On Jan 31, 2008, at 3:08 PM, DM Smith wrote:
> I imagine there is a C/C++ routine that will convert from an entities
> codepoint to a UTF-8 Character.

The numeric entities can presumably be interpreted as UTF-32 and  
encoded as UTF-8 on that basis using either ICU's routines or those in  
Sword. The one hangup might be if someone encodes UTF-16 surrogate  
pairs as entities. I'm not even sure whether that is legal, much less  
how likely it would be for someone to do.

> I'm working on adding -n to osis2mod that will normalize UTF-8 to NFC.
> There's a bug in it and I'll be posting separately about it.

Are you using ICU? There's code in utf8nfc.cpp (in the filters  
directory) that should work to do the translation. We might even be  
able to use ICU to solve the surrogates issue with a little work.

--Chris




More information about the sword-devel mailing list