[sword-devel] osis2mod change

Fri Feb 29 13:41:46 MST 2008

Chris Little wrote:
> DM Smith wrote:
>   
>   > I mostly agree. But once I know that the module is NFC, I'd rather not
>   
>> take the hit. I must have made the KJV into a module 100 or more times 
>> before I got it right.
>>     
>
> What would you think of making normalization the default and using a 
> switch to turn it off? It doesn't particularly matter for me, but I'm 
> thinking of a complete newbie trying to make a module. The defaults 
> should be as general purpose and common as possible. Then again, since 
> we build from source for our releases (not from submitted compiled 
> modules), perhaps it doesn't matter either way.
>   

I made the change. The flag used to be -n, now it is -N.

>   
>>> Second, your comment about needing UTF-8 input makes me think we should
>>> go ahead and add encoding conversion to the importers as well, possibly
>>> with automatic charset detection.
>>>       
>> I'd like to see OSIS modules also be UTF-8.
>>
>> What mechanism were you thinking of for automatic charset detection? I 
>> have a buggy routine to detect whether something is UTF-8, 7-bit ascii 
>> or other. We could use that (once I fix it).
>>
>> As to automatic charset detection, could we require that every input to 
>> osis2mod have:
>> <?xml version="1.0" encoding="UTF-8"?>
>> or
>> <?xml version="1.0" encoding="cp1252"?>
>> and use whatever is the value for the encoding attribute?
>>     
>
> I planned to just use ICU's charset detection. It takes a bunch of text, 
> runs some heuristic algorithms on it, and uses that to guess the 
> charset. It supports most of the common standard charsets.
I added utf-8 detection and conversion to UTF-8 from cp1252/Latin-1 
using Sword's Latin1UTF8 filter. I did some reading on the icu 
converters and they need a decent sized sample to do the detection. And 
in the list of supported encodings windows-1252 was not listed. Many of 
our non-ASCII latin1 texts only have the cp1252 quotation marks and 
such. So, it would probably take some significant testing to see if icu 
would detect cp1252 as ISO-8859-1.

The impact of these changes, with -N on a text (KJV) that is mostly 
ASCII with a little bit of UTF-8 (already NFC), -N took 14 seconds to 
process. Without the flag and no detection it took 16 seconds. With 
detection and normalization when necessary it took 15 seconds. So the 
impact seems to be small.

Note, the -N flag is necessary to preserve Latin-1.

If this not what you wanted, please feel free to revert it.

In Him,
    DM