[sword-devel] USFM conformance in usfm2osis.py

Wed Aug 1 01:11:19 MST 2012

Hi Chris,

There are some empirical aspects of USFM that are not specified in the *USFM
User Reference*.
They seemed to be defined /de facto/ by UBS Paratext.

Note that the  USFM standard specifies the syntax for verse tags as:
*\v_#*
and does not have an underscore after the #.

In fact, some white space is required there for all software that reads USFM
files, but the white space could be space, tab, or one or both of the line
end characters (carriage return, line feed).

A poorly documented fact about USFM is that within a text field, line end is
equivalent to a space, and multiple spaces (that aren't part of the markup)
are equivalent to one space.
Thus the following are equivalent:

\v1
In
the
beginning,     God ...

and

\v1 In the beginning, God ...

Three poorly defined uses of USFM verse tags that we often encounter are as
follows:

Spanned verses (translators use a verse range), thus

\v 7-11 Text for these five verses

or (worse still, a naughty use of the comma delimiter)

\v 3,4 Text for two consecutive verses

Split verses (translators using composite verse tags with parts a and b of
the text), e.g.

\v 19a Text for the first part of verse nineteen
\v 19b Text for the second part of verse nineteen

and  when this is done, there are sometimes extra USFM tags between parts a
and b, e.g.

\v 19a Tarus, dia makang la jadi kuat kombali.
\s1 Saulus kasi tau Kabar Bae soal Yesus par orang-orang di Damsik
\sr 9:19b-25
\p
\v 19b Saulus tinggal deng orang-orang yang iko Yesus di kota Damsik kurang
labe dua tiga hari bagitu.

That's a real world example that I just encountered. 

Michael J. and I discussed these things recently (May 4). Further he writes:

  Unlike OSIS, USFM leaves little wiggle room for interpretation, and when
it does, the master reference implementation, Paratext, rules, for pragmatic
reasons.

  In general, I try to read USFM with reasonable tolerance, and write it
with reasonable consistency.
  My USFM reader, for example, always accepts \mt as being equivalent to
\mt1, no matter if \mt2 is present or not.
  When writing USFM, though, it is better practice to always include the
"1".

The examples of *\v_#* in the USFM reference manual all include a space
after the numeric verse number.

Empty verses: (with no verse text at all)

Paratext generates them with no trailing space when creating an empty verse
template. Some software that reads USFM expects to find a trailing space,
because the USFM user reference examples are all with real text. Something
else therefore for your Python script to be flexible about.

Last week, I came across an unexpected use of the \imt tag in the same line
as the \mt tag text.

\mt Paulus pung Surat Kadua par Jamaat di Tesalonika\imt Kata-Kata Partama

If the examples in the user reference were definitive, there would be a line
break before the \imt
yet apparently Paratext had not complained when there wasn't. 

This sort of thing does seem to crop up quite often, and I've no idea
whether all of them would be detected by the USFM <==> USX processes that
are now done 'under the hood' by Paratext.

btw. Peter and I have collected a substantial body of real world USFM suites
which you could probably use for testing your conversion script.

Best regards,
David

--
View this message in context: http://sword-dev.350566.n4.nabble.com/USFM-conformance-in-usfm2osis-py-tp4650705p4650707.html
Sent from the SWORD Dev mailing list archive at Nabble.com.