[sword-devel] osis2mod change

Chris Little chrislit at crosswire.org
Mon Feb 25 22:03:00 MST 2008



DM Smith wrote:
> 
  > I mostly agree. But once I know that the module is NFC, I'd rather not
> take the hit. I must have made the KJV into a module 100 or more times 
> before I got it right.

What would you think of making normalization the default and using a 
switch to turn it off? It doesn't particularly matter for me, but I'm 
thinking of a complete newbie trying to make a module. The defaults 
should be as general purpose and common as possible. Then again, since 
we build from source for our releases (not from submitted compiled 
modules), perhaps it doesn't matter either way.

>> Second, your comment about needing UTF-8 input makes me think we should
>> go ahead and add encoding conversion to the importers as well, possibly
>> with automatic charset detection.
> 
> I'd like to see OSIS modules also be UTF-8.
> 
> What mechanism were you thinking of for automatic charset detection? I 
> have a buggy routine to detect whether something is UTF-8, 7-bit ascii 
> or other. We could use that (once I fix it).
> 
> As to automatic charset detection, could we require that every input to 
> osis2mod have:
> <?xml version="1.0" encoding="UTF-8"?>
> or
> <?xml version="1.0" encoding="cp1252"?>
> and use whatever is the value for the encoding attribute?

I planned to just use ICU's charset detection. It takes a bunch of text, 
runs some heuristic algorithms on it, and uses that to guess the 
charset. It supports most of the common standard charsets.

--Chris






More information about the sword-devel mailing list