[sword-devel] osis2mod change

DM Smith dmsmith555 at yahoo.com
Sun Feb 24 15:25:21 MST 2008

On Feb 24, 2008, at 4:46 PM, Chris Little wrote:

> DM Smith wrote:
>> I have added a -n flag to osis2mod.
> I'm going to add it to the other major importers (osis2gbs & imp2*)  
> just
> as soon as I get things into a fairly stable state.
>> This flag, to be enabled, requires osis2mod to be compiled with ICU
>> support enabled.
>> -n stands for normalized to NFC, the agreed upon UTF-8 encoding
>> When should this flag be used?
>> 1) When the input is UTF-8
>> and
>> 2) It is not known to be NFC
> First, I feel like there's really no reason NOT to perform
> normalization, provided that the input is UTF-8. Even if the input is
> already in NFC, it won't hurt anything to do it again. It will take
> extra time to compile the module, but I feel like it's better to be  
> safe
> than sorry in this case.

I mostly agree. But once I know that the module is NFC, I'd rather not  
take the hit. I must have made the KJV into a module 100 or more times  
before I got it right.

> Second, your comment about needing UTF-8 input makes me think we  
> should
> go ahead and add encoding conversion to the importers as well,  
> possibly
> with automatic charset detection.

I'd like to see OSIS modules also be UTF-8.

What mechanism were you thinking of for automatic charset detection? I  
have a buggy routine to detect whether something is UTF-8, 7-bit ascii  
or other. We could use that (once I fix it).

As to automatic charset detection, could we require that every input  
to osis2mod have:
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="cp1252"?>
and use whatever is the value for the encoding attribute?

-- DM

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/sword-devel/attachments/20080224/b505d1a6/attachment.html 

More information about the sword-devel mailing list