[sword-devel] usfm2osis.pl

Chris Little chrislit at crosswire.org
Mon Jul 9 00:13:13 MST 2012

On 7/8/2012 10:43 PM, Greg Hellings wrote:
> Guys,
> Was just running usfm2osis.pl across some files that my Aunt and Uncle
> have given me to convert for the language they're working with through
> Wycliffe. It ran great, saw no problems with it. When I tried to run
> title_cleanup.pl across the output it revealed a minor issue... the
> language they have used appears to use the "French style" of quotation
> mark, but it is marked up in the SFM text as "<<" and ">>". A pair of
> ASCII angle characters. This causes title_cleanup.pl, which is
> expecting good XML, to puke on parsing the file. Of course, it would
> also cause osis2mod to puke when I get to that stage.
> Obviously this is an encoding issue in the source file, but I thought
> I should mention that this is also a bug/shortcoming of usfm2osis.pl.
> If it is supposed to be outputting well-formed XML then it should
> encode the plain text to escape such characters with their proper XML
> entity representations. Is there anyone who wants to look into that,
> or do I need to roll up my Perl sleeves and get dirty?

Handling of <</>>-style SFM quotation marks was formerly part of 
usfm2osis.pl, but has been commented out. The angle-brackets are not 
necessarily used to encode French-style chevrons for quotation marks, 
since they were also used in many SFM files to encode curly-quotes, as 
used in English typography.

I don't think I've ever seen angle-brackets in a USFM file that were 
supposed to be present. The example you cite is SFM, which we obviously 
can't reliably support. The fact that we do not handle angle-brackets 
helps to identify encoding errors in the text. The alternative would be 
to convert them to XML escapes and pass the mis-encoded characters on to 
the OSIS document, where they would probably go unnoticed.

So, all things considered, I think it's a good thing that the output of 
usfm2osis.pl caused later utilities to choke, thereby signaling that you 
need to correct the character encoding problem in the source.


More information about the sword-devel mailing list