[sword-devel] osis2mod change

DM Smith dmsmith555 at yahoo.com
Tue Feb 26 05:52:34 MST 2008

On Feb 26, 2008, at 12:03 AM, Chris Little wrote:

> DM Smith wrote:
>> I mostly agree. But once I know that the module is NFC, I'd rather  
>> not
>> take the hit. I must have made the KJV into a module 100 or more  
>> times
>> before I got it right.
> What would you think of making normalization the default and using a
> switch to turn it off? It doesn't particularly matter for me, but I'm
> thinking of a complete newbie trying to make a module. The defaults
> should be as general purpose and common as possible. Then again, since
> we build from source for our releases (not from submitted compiled
> modules), perhaps it doesn't matter either way.

Either way is fine. I, too, would like to make osis2mod be simpler for  
a complete newbie.

>>> Second, your comment about needing UTF-8 input makes me think we  
>>> should
>>> go ahead and add encoding conversion to the importers as well,  
>>> possibly
>>> with automatic charset detection.
>> I'd like to see OSIS modules also be UTF-8.
>> What mechanism were you thinking of for automatic charset  
>> detection? I
>> have a buggy routine to detect whether something is UTF-8, 7-bit  
>> ascii
>> or other. We could use that (once I fix it).
>> As to automatic charset detection, could we require that every  
>> input to
>> osis2mod have:
>> <?xml version="1.0" encoding="UTF-8"?>
>> or
>> <?xml version="1.0" encoding="cp1252"?>
>> and use whatever is the value for the encoding attribute?
> I planned to just use ICU's charset detection. It takes a bunch of  
> text,
> runs some heuristic algorithms on it, and uses that to guess the
> charset. It supports most of the common standard charsets.

I think this would work well for most modules. My concern would be  
that we would then also have a way to tell it that the input is a  
particular charset in case the heuristic failed. For example, imagine  
an English text that only has a few high order bit characters, such as  
"smart" single quote, used as an apostrophe. It might not provide  
enough data for the detection to work or the instance might not be in  
the window used for detection.

More information about the sword-devel mailing list