[sword-devel] XML Numeric character references (entities) in BibleCS

DM Smith dmsmith555 at yahoo.com
Thu Jan 31 16:08:32 MST 2008


On Jan 31, 2008, at 5:16 PM, Chris Little wrote:

>
> On Jan 31, 2008, at 1:29 PM, Benny Wasty wrote:
>
>> Hello,
>>
>> I noticed that BibleCS doesn't seem to be able to display unicode
>> characters encoded as numeric character references (e.g. ö) in  
>> an
>> OSIS module I am currently working on. The characters are just
>> omitted.
>> I guess they should be displayed correctly, as this a "basic" XML
>> feature as far as I know.
>> BibleDesktop shows them by the way.
>
> Correct, Sword does not handle numbered entities. I don't think we
> want to add support for them at runtime either, because doing so would
> 1) waste processor time in converting to UTF-8 and 2) waste a lot of
> storage space compared to UTF-8. I will, however add a todo to the bug
> tracker to do conversion to UTF-8 during import.

While we might not want to add entities, we can improve osis2mod to  
either flag them with a warning, to convert them or to die with a  
fatal error.

I imagine there is a C/C++ routine that will convert from an entities  
codepoint to a UTF-8 Character.


> All data in modules is assumed to be NFC normalized UTF-8.

Correct me if I am wrong. The significance of this is that entities  
are not part of NFC UTF-8.

I'm working on adding -n to osis2mod that will normalize UTF-8 to NFC.  
There's a bug in it and I'll be posting separately about it.


>
>
> I haven't looked at the code or tested this, but I would be willing to
> bet BibleDesktop is displaying you characters correctly but wouldn't
> match them in a search.

You are right about search. BD and any of the other frontends (Bible  
Tool, MacSword, GnomeSword) that display HTML would handle them  
properly. And none of our index code converts entities to characters.

In Him,
	DM





More information about the sword-devel mailing list