[sword-devel] usfm2osis.pl

Mon Jul 9 07:42:45 MST 2012

Just a comment to the whole thread, regarding XML entities:
Since we are using a schema and not a DTD for validation of OSIS, we don't have many entities that will validate: amp, lt, gt, quot, and apos. These should be used as sparingly as possible.

Character references (&xYYYY;) are allowed, but cause havoc in non-HTML renderers. Osis2mod should convert these to their Unicode reference.

Regarding the chevron encoding, it *may* be reasonable for usfm2osis.pl or osis2mod to warn of problems. Yes, an XML validator should be used before osis2mod, but the errors it produces are sometimes less than helpful.

In Him,
	DM

On Jul 9, 2012, at 9:26 AM, Greg Hellings wrote:

> On Mon, Jul 9, 2012 at 8:08 AM, Chris Little <chrislit at crosswire.org> wrote:
>> On 7/9/2012 5:29 AM, Greg Hellings wrote:
>>> 
>>> On Mon, Jul 9, 2012 at 2:13 AM, Chris Little <chrislit at crosswire.org>
>>> wrote:
>>>> 
>>>> On 7/8/2012 10:43 PM, Greg Hellings wrote:
>>>>> 
>>>>> 
>>>>> Guys,
>>>>> 
>>>>> Was just running usfm2osis.pl across some files that my Aunt and Uncle
>>>>> have given me to convert for the language they're working with through
>>>>> Wycliffe. It ran great, saw no problems with it. When I tried to run
>>>>> title_cleanup.pl across the output it revealed a minor issue... the
>>>>> language they have used appears to use the "French style" of quotation
>>>>> mark, but it is marked up in the SFM text as "<<" and ">>". A pair of
>>>>> ASCII angle characters. This causes title_cleanup.pl, which is
>>>>> expecting good XML, to puke on parsing the file. Of course, it would
>>>>> also cause osis2mod to puke when I get to that stage.
>>>>> 
>>>>> Obviously this is an encoding issue in the source file, but I thought
>>>>> I should mention that this is also a bug/shortcoming of usfm2osis.pl.
>>>>> If it is supposed to be outputting well-formed XML then it should
>>>>> encode the plain text to escape such characters with their proper XML
>>>>> entity representations. Is there anyone who wants to look into that,
>>>>> or do I need to roll up my Perl sleeves and get dirty?
>>>> 
>>>> 
>>>> 
>>>> Handling of <</>>-style SFM quotation marks was formerly part of
>>>> usfm2osis.pl, but has been commented out. The angle-brackets are not
>>>> necessarily used to encode French-style chevrons for quotation marks,
>>>> since
>>>> they were also used in many SFM files to encode curly-quotes, as used in
>>>> English typography.
>>> 
>>> 
>>> Speaking with the translator they encouraged me to replace the
>>> two-character ASCII sequence with the proper chevron characters.
>>> Although when going to print they are replacing the characters with
>>> curly braces, they like the idea of the chevrons better. The real
>>> issue is that the double angle brackets are not the only time that '>'
>>> and '<' appear in the text. When outputting the text to an XML output
>>> format, should these characters not be encoded to XML entities?
>>> Dealing with scripts developed for non-Latin languages we can't just
>>> assume these characters won't appear in some arcane form
>>> 
>>>> 
>>>> I don't think I've ever seen angle-brackets in a USFM file that were
>>>> supposed to be present. The example you cite is SFM, which we obviously
>>>> can't reliably support. The fact that we do not handle angle-brackets
>>>> helps
>>>> to identify encoding errors in the text. The alternative would be to
>>>> convert
>>>> them to XML escapes and pass the mis-encoded characters on to the OSIS
>>>> document, where they would probably go unnoticed.
>>>> 
>>>> So, all things considered, I think it's a good thing that the output of
>>>> usfm2osis.pl caused later utilities to choke, thereby signaling that you
>>>> need to correct the character encoding problem in the source.
>>> 
>>> 
>>> What if this were a case where the angle brackets were properly encoded as
>>> such?
>> 
>> 
>> If < or > should really appear in the text, then certainly they should be
>> escaped. I've honestly just never seen that in USFM files. I have often seen
>> angle brackets used in old SFM files, where they always (or at least almost
>> always) represent some kind of quotation mark--generally chevrons.
>> 
>> My position is chiefly that we shouldn't handle angle brackets in
>> usfm2osis.pl because they are usually encoding errors, from the perspective
>> of USFM. Not handling them in the script lets us identify and correct these
>> encoding errors, which would probably go unnoticed otherwise. (The original
>> version of the sfm to osis converter did handle them because it was tailored
>> to a particular set of sfm files and predated the USFM spec online.)
>> 
>> This is basically a compromise position. If we could guarantee that the
>> input to usfm2osis.pl was always USFM, we could just escape all
>> angle-brackets. Since we would like to accept most SFM content and do
>> something semi-reasonable with it, we should probably not handle
>> angle-brackets because it's difficult to predict what the encoder intended.
> 
> In my case it appears that all instances of single angle brackets
> appear inside of double angle brackets, so I'm going on the assumption
> I'm dealing with nested quotes and replacing them with single chevrons
> for now just as the translator said to do with double angle brackets
> and double chevrons. If leaving angle brackets unencoded is
> intentional, that's fine. I just didn't want it to be an oversight
> when encoding to XML.
> 
> --Greg
> 
>> 
>> 
>> --Chris
>> 
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page