[sword-devel] USFM -> OSIS -> Sword

Tue Mar 6 00:59:45 MST 2012

On 06/03/12 01:42, Kahunapule Michael Johnson wrote:

> It is most likely a misunderstanding. Perhaps I have also been
> misunderstanding some of the messages that seem to be opposed to USFM.
> I'm not trying to suggest that USFM be made an additional internal
> format for Sword for Bible search and display, like GBF and OSIS.

Ok, good this has been cleared up somewhat.

Let me answer a few points

> I'm trying to convert Scripture files on a scale and with speed that
is apparently unprecedented.

The scale might be unprecedent, but there are reasons for that

1) None of us have yet encountered a USFM text which is actually clean.
It does not exist. Both UBS and SIL have USFM gurus which rove about and
help people fixing things, but even the cleanest texts I have
encountered did not conform 100% (i.e. not 99.999% but 100% to the USFM
standards. Many times this is a minor irritation, but far too often this
is actually crucial. XML validation tends to throw things up like that.
The stuff which really concerns is around missing chapter numbers,
missing spaces after tags, letters replacing numbers (1/l/I) (0/O),
verse ranges not properly encoded etc. etc. The older translations are -
i.e. pre-unicode and the "cleverer" translators needed to be to make
things work the more I find of that kind of stuff. I know that.

2) The next thing are missing tags in our routines - you describe
usfm2osis.pl as a small subset of USFM tags. I am not sure where you are
looking, given that the svn version is the only one I ever use and svn
usfm2osis.pl rarely stumbles nowadays, but I will admit that we never
set out to create an all encompassing routine, but we have set up a
structural skeleton and add tag by tag as we encounter it. Right now, I
think all commonly used tags are in it, and every missing tag has its
pre-ordained place. The more though we move into the realm of existent
but infrequently used tags, the more I find that people used them who
had no clue of the USFM specification. They used these tags either for
their graphical presentation or they simply misread them. So I have
found the q series of poetic tags being used for quotations and vice
versa. The roving USFM gurus make these kind of thing less common, but
it still exists - particularly in older translations. So what we need to
do often is going back to the translation team and ask them - what did
you mean with this tag? And we might need correcting the USFM and
obviously, we might add something to usfm2osis.pl from time to time

3) Missing \id markers are surprisingly common even in very good USFM texts.

4) Plain wrong use of markers - \d and \s e.g. in Psalms. In print it
might make no difference, but if you can switch on and off headlines
then it is relevant to figure out which ones are canonical and which
ones are an aid for the modern reader. And if all are encoded as \s and
you do not speak the language, then you have a task in front of you..

5) intros and canonical text - people not always use the tags provided
for introductions. This diminishes somewhat the returns on being able to
make intros appear different. Again, I usually highlight this back and
fix it.

etc

As a consequence, I have found instead of having a straight process

USFM -> OSIS -> module

I go round the circle several times, fixing the USFM, writing to the
translators, getting them to accept my changes to the encoding, pointing
out missing verses, clarifying verse ranges, etc etc.

This process is now highly automatic and the manual bits I can do in my
sleep largely, but there remains a manual element to it in each and
every translation.

It is possible to cut corners - for sure. And I do not cut any, as I see
my work in part as service to the translators too - I want to honour
their effort and I want to help them to get the best possible result in
the long run. Fixing USFM encoding will improve the maintainability of
their text and help making other things easier too - e.g. USFM -> XETEX
typesetting -> paper. Another reason is that I am dealing with
translations which are not anymore in flux, but are finalised and in
print. So, I am not working with a moving target.

Could you automatise it completely?

I think yes, if you make some deliberate choices.

Missing tags are a minor nuisance - it should be easy to obtain a
complete list of missing tags for all 300 translations by running a
churn of all files through usfm2osis.pl and filtering out what is
missing and adding it. Next time round there won't be any missing
targets. It will be maybe 10-20 tags, most of these will be simple to
implement.

Misused tags are a problem. You need a filtering process to find the
commonly misused ones and either highlight these back to the translators
and drop the translation until you get a cleaned up text or indeed do
something by hand. Up to you.

Structural problems - letters for numbers, missing verses, missing
chapter numbers, poorly encoded ranges - instead of fixing, I would
suggest devising filters and dropping these/referring back to translators.

Quotation marks. WoJ and presentational markup

I guess one can aim too high and then fall down. If you do not encode
these in OSIS, then all marks will remain just as the translator
intended. Which is clearly the best.

I have never even attempted to do OSIS encoding of speech. The reason is
simple - most texts have unequal opening and closing marks of speech. So
you might know something starts, but never where it closes. David
conscientiously produces a report telling the translators "You have used
12347 opening and 12321 closing quotation marks!" No one has ever sent
us a fixed text after that. So, I have given up on encoding this kind of
stuff - unless there is specific USFM tagging like \wj.

Similarly presentational markup - I usually try and extract out of the
translators what they meant by this and then suggest better USFM tags,
but if you do not want to do this - noone is going to crucify you if you
produce presentational OSIS in such cases.

Finally - we do try and maintain a very high standard in all modules we
put into our own repo. But if you work with moving targets -
translations which are still in flux and get constantly updated, I think
it is more than fair enough to simply run through with a process and put
stuff into a public repo. VERY FEW of the flaws I explained above will
actually totally break a sword module. Teus Benshop of bibledit has his
own fairly brute force conversion routine in bibledit. None of the
modules he produces would pass muster for our repo. But they do just
fine in the circumstances he requires them - for people to see the
progress of the translation, for using in Xiphos which he ties against
Bibledit during translation etc etc.

I hope this clarifies things. Sorry for the long epistle.

I summarise - if you want to go all mechanical do following:

identify all missing tags in bulk and we/you can add them to
usfm2osis.pl. Small job. Maybe also create a "toned down" version of
usfm2osis.pl for you which reduces the risk of dual use USFM tags by
marking stuff deliberately only presentational.

devise some filter to drop what needs dropping - letters in verse and
chapter numbers are the biggest no-no. missing \id and \c markers the
other one. Create a small add on to usfm2osis.pl which will fix the
single chapter books where no chapter marker is used commonly. The rest
will not break a module

Create a repo and let people use it.

Lean back and relax in your tropical sunshine

---

Yours

Peter

Yours

Peter