[sword-devel] USFM -> OSIS -> Sword

Tue Mar 6 15:21:39 MST 2012

On 03/05/2012 09:59 PM, Peter von Kaehne wrote:
> ...
>> I'm trying to convert Scripture files on a scale and with speed that
> is apparently unprecedented.
>
> The scale might be unprecedent, but there are reasons for that
>
> 1) None of us have yet encountered a USFM text which is actually clean.
> It does not exist. ...

You just didn't look long enough. :-)

Seriously, my experience is similar to yours, and even includes novel USFM errors you haven't listed yet, BUT I'm talking about texts that have already been cleaned up. There really is such a thing as clean USFM, clean enough to pass validation by Paratext, Bibledit, and Haiola. (Yes, I have legitimate access to Paratext.) Therefore, I have hundreds of actually clean USFM texts. Any truly USFM-compliant reader should be able to handle all but one or two of the USFM Scriptures in my current collection. (The
others need some input from translators to fix.)

One challenge I face in my input texts is incompleteness. That is a direct consequence of working with texts that are under construction. Sometimes the incomplete state of a text will last indefinitely, because nobody is working on completing it. The only choices here are to either not use the module, strip out incomplete books, or deal appropriately with partial books. USFM and OSIS can both handle partial books, but I'm wondering what effect these might have on the full process, including front ends.

Yes, getting texts into clean Unicode USFM can be challenging, but for contemporary Bible translation projects done with Paratext, Bibledit, Adapt It, or SIL Translation Editor and run through certain publication experts, I'm starting to get very clean files. What isn't clean, I clean up.

> 2) The next thing are missing tags in our routines - you describe
> usfm2osis.pl as a small subset of USFM tags. ...

I'll take another look at the current svn version. Just a cursory glance at the source tells me that it is still missing necessary tags for the set of data I need to process.

> ... Fixing USFM encoding will improve the maintainability of
> their text and help making other things easier too - e.g. USFM -> XETEX
> typesetting -> paper.

Yes, indeed. I feed back my cleaned up USFM to the appropriate Bible translation agencies for their archives for that reason.

>  Another reason is that I am dealing with
> translations which are not anymore in flux, but are finalised and in
> print. So, I am not working with a moving target.

Lucky you. :-)
I have a mixture of finalized and in-progress translations.

> Quotation marks. WoJ and presentational markup
>
> I guess one can aim too high and then fall down. If you do not encode
> these in OSIS, then all marks will remain just as the translator
> intended. Which is clearly the best.

I agree.

> David
> conscientiously produces a report telling the translators "You have used
> 12347 opening and 12321 closing quotation marks!" No one has ever sent
> us a fixed text after that. ...

That may be because the mismatch is correct. Not all languages are like English, but in English, it is normal to have more opening quote marks than closing quote marks. The rule is that when a quotation crosses paragraph boundaries, you don't put a closing quote mark there, but you do put another opening quote mark at the beginning of the next paragraph. Another confusing factor is the fact that apostrophe, closing single quote, straight opening single quote, and sometimes glottal stop might use the same
symbol, even in nice clean Unicode texts. I stopped trying to do smart single curly quotes because of the glottal stop ambiguity in several languages. (I even do that myself in Hawai'i, as do people named O'Reilly. I can disambiguate English or Hawaiian programmatically, but it is too hard in languages I don't know.) BUT, the punctuation is correct in the text, so it is best to leave it there as is.

>  So, I have given up on encoding this kind of
> stuff - unless there is specific USFM tagging like \wj.

Yes, and even then, beware the ambiguity of <q who="Jesus"> without a marker="" attribute, lest you end up adding in extra quote marks.

> Similarly presentational markup - I usually try and extract out of the
> translators what they meant by this and then suggest better USFM tags,
> but if you do not want to do this - noone is going to crucify you if you
> produce presentational OSIS in such cases.

I also try... but sometimes the translators are not available for me to consult because they have moved to Heaven or something. And sometimes I know the meaning of the markup, and it doesn't match anything nonpresentational in USFM or OSIS. Therefore I appreciate not having a sentence of death by crucifixion pronounced on me.

> Finally - we do try and maintain a very high standard in all modules we
> put into our own repo. But if you work with moving targets -
> translations which are still in flux and get constantly updated, I think
> it is more than fair enough to simply run through with a process and put
> stuff into a public repo.

Sounds good to me. Of course, I would still like feedback if something is amiss in such a module.

>  VERY FEW of the flaws I explained above will
> actually totally break a sword module. Teus Benshop of bibledit has his
> own fairly brute force conversion routine in bibledit. None of the
> modules he produces would pass muster for our repo. But they do just
> fine in the circumstances he requires them - for people to see the
> progress of the translation, for using in Xiphos which he ties against
> Bibledit during translation etc etc.

I've tried Bibledit's export. I'm not sure what you like and don't like about it. I haven't looked under the hood of Teus' source code, yet, to see what he does.

> I hope this clarifies things. Sorry for the long epistle.

Yes, it helps. Thank you for taking the time to write your long epistle. :-)

> I summarise - if you want to go all mechanical do following:
>
> identify all missing tags in bulk and we/you can add them to
> usfm2osis.pl. Small job. Maybe also create a "toned down" version of
> usfm2osis.pl for you which reduces the risk of dual use USFM tags by
> marking stuff deliberately only presentational.

I'll see if I can enumerate what is missing for real. (I understand that sometimes comments and code get out of sync-- no condemnation, just acknowledging the struggle.) I also have a framework for USFX->OSIS conversion already existing in C# that I'm more familiar with, so I may use that instead. (I'm fluent in C# but haven't touched Perl for a long time, and am a bit rusty in that language.) Either way, seeing what usfm2osis.pl produces on "easy" texts should be helpful.

And yes, deliberately marking things up presentationally where the translator has done so is an acceptable option. Honestly, the rationale behind semantic vs. presentational markup, i. e. easy repurposing of texts for display and printing in different media and formats, is already accomplished without eradicating the last few presentational elements. Semantic markup need not be absolute. Indeed, HTML became a whole lot more useful when combined with CSS and the ability to specify fonts, etc. Specifying
fonts can be very important in some cases, like the Cambodian Khmer Bible online. (I don't have a CC license on that one, but I do have permission to post it on eBible.org. It is also on BibleCambodia.org.)

> devise some filter to drop what needs dropping - letters in verse and
> chapter numbers are the biggest no-no. missing \id and \c markers the
> other one. Create a small add on to usfm2osis.pl which will fix the
> single chapter books where no chapter marker is used commonly. The rest
> will not break a module

By the time USFM files get to this point in my process, there will be no missing \id or \c markers, nor will there be strange characters in \c or \v. The \id lines will all have the proper 3-letter book identification. There will always be a \c 1 in one-chapter books. However, the list of markers that exist in the current USFM standard but aren't used is very short: primarily the study Bible items like sidebars and some peripheral material markers. Currently, I've also stripped out illustrations (because of
copyright permission problems, rather than technical difficulties).

> Create a repo and let people use it.

I'll probably need a little help with the best way to do that. It should be simple enough, given that I have plenty of web and ftp server space.

> Lean back and relax in your tropical sunshine

It is nice, but sometimes I miss other places. If you move around enough as a missionary, you might find that you can be homesick anywhere. Next week I'll be out of the tropics and in Germany for a Scripture publishing summit and another related conference. Please pray that it goes well and that I please Father God in the process.

Thank you for your patience with me. :-)

Aloha,
Michael
http://MLJohnson.org