[sword-devel] USFM conformance in usfm2osis.py
chrislit at crosswire.org
Wed Aug 1 23:54:53 MST 2012
On 08/01/2012 01:11 AM, David Haslam wrote:
> Three poorly defined uses of USFM verse tags that we often encounter are as
> Spanned verses (translators use a verse range), thus
> \v 7-11 Text for these five verses
> or (worse still, a naughty use of the comma delimiter)
> \v 3,4 Text for two consecutive verses
> Split verses (translators using composite verse tags with parts a and b of
> the text), e.g.
> \v 19a Text for the first part of verse nineteen
> \v 19b Text for the second part of verse nineteen
Based on my experience encoding USX, I believe these are all invalid.
They're definitely invalid from the perspective of USX, where the verse
values must be sequential numerals. But we necessarily need to treat
them all as valid because a reasonable reading of the USFM documentation
give no indication that these should be invalid.
> btw. Peter and I have collected a substantial body of real world USFM suites
> which you could probably use for testing your conversion script.
That's likely to be quite helpful soon, when I get to the point of
writing regression tests. I've currently got the Open Bible Translation,
WEB, and RV happily running through the script and generating valid
OSIS. There are still 19 tags that usfm2osis.pl handles which I haven't
addressed in usfm2osis.py. So I'd like to add handling for all of those
tags, so that the new utility is at least as capable as the old. Then I
may run it on Michael's collection of documents and complete coverage of
its set of tags.
Then, I'll begin collecting markup samples, verifying that the script
generates valid & reasonable output against those samples, and writing
tests against the generated output so that we can be certain that future
changes to the script do not break it.
Somewhere down the line I may incorporate xreffix.pl-type functionality
as an option for those who have installed Sword bindings.
I'll definitely save the example USFM markup you included here and would
welcome additional examples of potentially problematic markup.
More information about the sword-devel