<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

<h1 class="western">Why Use OSIS When USFM and USFX Work Better?</h1>

<p style="margin-bottom: 0in;" align="right"><i><font

 style="font-size: 8pt;" size="1">By

(Kahunapule) Michael Johnson, </font></i><a

 href="http://kahunapule.org/"><i><font style="font-size: 8pt;" size="1">http://kahunapule.org</font></i></a></p>

<p style="margin-bottom: 0in;" align="right"><font

 style="font-size: 8pt;" size="1"><i>6

February 2006</i></font></p>

<h2 class="western">Conclusion</h2>

<p>OSIS is a poor choice for a standard Scripture archiving,

authoring, and interchange format for members of the Forum of Bible

Translation Agencies. Its inadequacies can be patched, but it

probably can't be made truly good without violating backward

compatibility constraints. Using OSIS is likely to make software

development efforts more costly and slower than necessary. OSIS is

not better than USFM, overall. I present a viable XML alternative,

below. It is likely that other options that are better than OSIS

exist or could be created. In the mean time, OSIS should be

considered experimental, and not used for production uses like

drafting, checking, publishing, or archiving of Scripture unless USFM

equivalents are kept up-to-date and stored along side of the OSIS

texts.</p>

<h2 class="western">My Biases</h2>

<p>(If you don't know what the Open Scriptural Information Standard

is, you can stop reading, now, and ignore both that proposed standard

and this document.)</p>

<p>I have been asked to write about Scripture file formats and the

suitability of OSIS for use in the SIL PNG Branch and in EBT. Before

I begin, let me explain a little bit about my qualifications to

comment on OSIS, what I have done with OSIS, and why I have much more

than a passing interest in OSIS.</p>

<p>I have been interested in Electronic publication and distribution

of the Holy Bible in various translations since long before I started

working on such things full time. While working a day job as a senior

software engineer, I would work on weekends and evenings on Bible

translation and electronic distribution. Part of the fruit of that is

a Public Domain modern English translation of the Holy Bible that is

distributed at <a href="http://WorldEnglishBible.org/">http://WorldEnglishBible.org</a>,

among other places. (I still do volunteer work on that project from

time to time.) Before I knew &#8220;Standard Format (SF)&#8221; existed and

before either XML or the World Wide Web were widely known, I thought

through the need for such a format for my own work, and generated a

Bible file format that is similar to SF, but differs in some details.

I still use that old format (GBF) to generate HTML, PDF, RTF, and

other formats. I learned about SF in volunteering for WA UK in

keyboarding Scriptures. Later, I joined EBT and (via secondment to

the PNG Branch) SIL. I have worked on Bible translation-related

software development, mostly, but I also manage the department of the

SIL PNG Branch responsible for Scripture typesetting and Scripture

archiving.</p>

<p>My interest and experience related to Scripture file formats is

more practical and experiential than theoretical, although I

certainly do apply information theory and best practices of software

design in my work. (My Master's Thesis was related to information

theory, specifically data compression and encryption.) I have written

software that reads and writes SF (and especially the UBS preferred

dialect of SF, USFM or Unified Standard Format Markup). I have also

studied and written software to handle XSEM and OSIS Scripture files.

In that process, I have gained some insights.</p>

<p>I monitor the progress of several open source Scripture-related

projects, as well as some of the SIL projects, although I concentrate

mostly on producing the tools I'm working on personally. Currently,

my main project is the Onyx Scripture Typesetting project. The idea

of that project, as well as its actual use, is simple, even though

the implementation was not. I provide a program that reads Unicode

USFM Scripture files and produces a Microsoft Word 2003 XML (WordML)

document that is essentially all typeset except for the front and

back matter, pictures, captions, and maps. Those can be inserted

manually using Microsoft Word's normal editing facilities. As an

added bonus, XML tags can be embedded in the WordML to allow a

reverse transformation back to USFM in some cases.</p>

<p>I started out being in favor of OSIS, and even tried to promote

it. I have since repented of that viewpoint, as it probably does more

harm than good by discouraging the development and use of much better

Scripture XML file schemas.</p>

<h2 class="western">XML Myths Debunked</h2>

<p>Myth 1: Anything in XML is inherently better for archiving and

processing than non-XML formats. <b>False.</b> XML is just a set of

rules defining how text files can define data, with tags, attributes,

and contents being easily separated and parsed. One disadvantage of

XML is that it forces strict nesting of elements, making it an

awkward basis for Biblical texts. (This shortcoming is easy to

overcome using milestones, which are empty elements that mark the

beginning or end of something. Unfortunately, there are some ways of

doing that which are error-prone and not elegant, like OSIS does.)

Just because something is in XML doesn't mean any software that can

read XML can make sense out of it. It all depends on the schema used.

XML can, in fact, be made arbitrarily obscure and arcane,

intentionally or otherwise. It can be encrypted, scrambled, and made

to conform to illogical structures. XML data can be arranged in

useful or non-useful ways, or any combination thereof.</p>

<p>Myth 2: XML is better than SF for processing because of the

software tools available for processing XML. <b>False.</b> There are

some good tools available for processing XML and transforming it to

other formats, but there are also pretty simple SF parsers available,

too. Implementing the latter is actually simpler than the former.</p>

<p>Myth 3: If data is expressed in XML, it can easily be transformed

to other formats. <b>False.</b> The data can only be transformed to

other formats if all required information for the target format is

present in the source format, and segregated with the same

granularity. Furthermore, the programming skills necessary to perform

these transformations are specialized knowledge that it is not

reasonable to expect the average computer user to be fluent in. (The

&#8220;average&#8221; computer user is probably challenged to understand a

tree-structured directory file system, let alone XSLT.)</p>

<p style="margin-bottom: 0in;">That said, I like XML, and I like to

use some of the software tools available to read, parse, transform,

and write XML. However, the suitability of XML to a given task

depends strongly on the schema and application.</p>

<h2 class="western">Why I Like USFM</h2>

<ol>

  <li>

    <p style="margin-bottom: 0in;">It is simple to understand, use, and

program for. It is simple enough to expect at least 50% of ordinary

working linguists (OWLs) to be able to understand and edit even in a

plain text editor, at least with the commonly-used features of it.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">It is well documented, and the

documentation is maintained and published in accessible formats (HTML

and PDF in a way that is easy to mirror on a notebook computer taken to

a remote village). The latest documentation is easy to find and clearly

labeled with its revision date.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">The maintainers of USFM are

responsive to comments and mindful of backward compatibility issues

when they make changes.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM is close enough to the

(depreciated-but-still-used) PNG SFM that updating to USFM is

reasonably painless. (In most cases, just a few global

search-and-replace operations do it.)</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM is well-enough defined that it

makes programming tools to read and write USFM easier to create and

maintain than doing the same for generic SFM.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM provides a real and practical

measure of cross-entity portability for Scripture texts, opening up

more options for typesetting, checking, and software tool creation and

use.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM takes full advantage of the

time-tested practical aspects of SFM in the experience of Bible

translators from multiple organizations, making incremental

improvements where appropriate.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM is a simple text-based,

easy-to-parse format that is robust, can be read by many software

tools, and will not go obsolete due to the obsolescence of any one

software tool or company. It is trustworthy for archiving purposes.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM allows the unambiguous encoding

of all essential elements of Scripture texts that I'm interested in

encoding, including every PNG language, and for that matter, the

Scriptures and essential peripherals (footnotes, section titles, etc.)

for any language I anticipate encountering.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">In the unlikely event that USFM

would be inadequate for a particular language or translation, it would

not be difficult to extend it for whatever unusual circumstances might

come up.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM has good software support with

Paratext, various Microsoft Word macros, Adapt It, Onyx, and various

other programs. Future support is being developed in the JAARS

Translation Editor.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM is simple enough to program for

that it can be used with low power computing devices.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM does nothing to force manual

processing of legacy data to make it conform to current standards.

Automated conversions from other dialects of SF are possible.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">By using USFM instead of PNG SFM, we

get can take advantage of new releases of Paratext style sheets, etc.,

without having to customize them for our own dialect of SF (like we

used to do with PNG SFM).</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">There is no problem encoding any of

the common variants in versification.</p>

  </li>

  <li>

    <p style="margin-bottom: 0in;">USFM is mostly backward compatible

with prior SF dialects, separating data with the same granularity. In

most cases, updating to USFM is a simple matter of a few consistent

changes of markers. In the unlikely event you would want to do the

reverse transformation, that is easy, too.</p>

  </li>

</ol>

<h2 class="western">What I Don't Like About USFM</h2>

<ol>

  <li>

    <p>The current version of USFM as I read it and as implemented in

Paratext is ambiguous with respect to the end point of character styles

in some cases. I have given the nitty-gritty details to the interested

parties, and expect a wise resolution. In the mean time, I have a

work-around involving USFX for those cases where I need it.</p>

  </li>

  <li>

    <p>USFM is not XML, so it can't be used where XML is required, such

as direct embedding in WordML or as input for an XSL Transformation. My

work-around for this is a direct conversion of USFM to XML using the

USFX schema, which has a very simple mapping of XML elements to

backslash codes in USFM, and which can represent all the same data with

no loss of information in the conversion from USFM to USFX. USFX is

documented at <a href="http://ebt.cx/usfx/">http://ebt.cx/usfx/</a>.</p>

  </li>

  <li>

    <p>USFM does not support footnote range start tags for easy

hyperlink generation, but most SIL members would never miss this

function.</p>

  </li>

</ol>

<h2 class="western">What I Like About OSIS</h2>

<ol>

  <li>

    <p>It is XML.</p>

  </li>

  <li>

    <p>It seems to have at least theoretical support by a wide

representation of interested parties, and seems to have some capable

salesmen working to establish it as a standard.</p>

  </li>

  <li>

    <p>USFM data can be converted to OSIS automatically if you accept

some modifications to the OSIS documented standard, and if you don't

mind adding some metadata from other sources. It is a little awkward,

and may involve loss of some metadata, but it is possible.</p>

  </li>

  <li>

    <p>OSIS documents can be converted to USFM if you can accept some

potential loss of data, in the cases where either the quotation

punctuation rules are simple or where the generator of that text

modified OSIS to make lossless conversion possible.</p>

  </li>

  <li>

    <p>It allows drafting of Bible texts marking only the beginning and

end of quotations, without having to manually adjust punctuation for

nesting level and open quote reminders at stanza and paragraph breaks

when appropriate for a particular language and style; promising that

some process will later supply the actual punctuation.</p>

  </li>

</ol>

<h2 class="western">What I Don't Like About OSIS</h2>

<ol>

  <li>

    <p>The quotation and speech markup is incomplete with respect to

multiple languages and styles, making it impossible to be sure that

OSIS readers would generate and display the correct quotation

punctuation for a given translation without extra external information.

OSIS does not define or provide a way of providing that extra

information, nor is it obvious how that information should be supplied.

Therefore, OSIS files are not self-contained with respect to all

important Scripture meaning-based data like USFM is.</p>

  </li>

  <li>

    <p>The latest documentation I read on OSIS indicated that it was

improper to put quotation punctuation directly in the text, instead

requiring it to be converted to markup-- a process that is difficult,

if not impossible to do automatically, especially without detailed

language-specific information.</p>

  </li>

  <li>

    <p>OSIS Scripture files are not self-contained with respect to all

of the meaning-based markup of the text, unlike USFM, except in some

simple cases.</p>

  </li>

  <li>

    <p>USFM and legacy SF texts cannot be fully automatically converted

to fully conformant OSIS with respect to quotation handling without

some serious manual intervention or language-specific programming.</p>

  </li>

  <li>

    <p>OSIS has no mechanism for encoding &#8220;red letter&#8221; editions of

Bibles other than &lt;q&gt; tags, and those could be interpreted by

OSIS readers to mean that punctuation should be inserted, even if the

target language and style forbids such insertion.</p>

  </li>

  <li>

    <p>OSIS takes the control of quotation punctuation out of the hands

of the translators and gives it to the programmers who write the

programs that interpret the OSIS.</p>

  </li>

  <li>

    <p>OSIS does not support footnote range start tags for easy

hyperlink generation.</p>

  </li>

  <li>

    <p>Handling of minor variations in versification is awkward in

OSIS. Older attempts at documenting OSIS made a stab at handling this,

but currently published documentation doesn't even address this issue.</p>

  </li>

  <li>

    <p>OSIS parsing is unnecessarily complex mostly due to the fact

that it does not handle the overlapping of book/chapter/verse,

quotations, and book/section/paragraph or stanza/verse/line hierarchies

of Scripture texts well. It really has multiple ways of handling these,

and OSIS readers have to deal with all of them, adding unnecessary

complexity.</p>

  </li>

  <li>

    <p>Start/end tag matching identifiers are used where they really

wouldn't be required, and add unnecessary complexity to OSIS

generation. This isn't a big deal for program-generated OSIS, but it is

probably enough all by itself to push the complexity past what most

OWLs can handle error-free for manual OSIS generation with a text

editor.</p>

  </li>

  <li>

    <p>There is a fair amount of ambiguity in the OSIS standard,

leading to doubts about reliable compatibility between different

software products using OSIS to interchange data.</p>

  </li>

  <li>

    <p>The current OSIS standard is not easy to find on the OSIS web

site, and the documentation that is there is downlevel.</p>

  </li>

  <li>

    <p>OSIS has inadequate software support for drafting, checking, and

publishing Scriptures.</p>

  </li>

  <li>

    <p>I have yet to see reliable converters between OSIS and USFM. (I

have written an OSIS writer myself, but it was impossible to complete

without &#8220;cheating&#8221; on the OSIS standard a little, making modifications

that the OSIS committee seems to be unwilling to make.)</p>

  </li>

  <li>

    <p>The unnecessary complexity of OSIS means that software written

to read and write will be more expensive, take longer to write, and

probably contain more bugs than software written to a simpler standard,

even though a simpler standard could do anything OSIS could do.</p>

  </li>

  <li>

    <p>The OSIS schema I used to program to when testing its

suitability could not handle simple things like supplied text (KJV

italics) within a Psalm title.</p>

  </li>

  <li>

    <p>OSIS is too complex to embed in WordML along with working

typeset text.</p>

  </li>

  <li>

    <p>OSIS could be made usable with some minor modifications, but

there is no indication that those modifications would ever be made.</p>

  </li>

  <li>

    <p>OSIS could never be made simple enough to be elegant and to save

on software development costs without sacrificing backward

compatibility. To really fix it, it would be better to replace it and

provide a conversion tool for legacy text. This, in turn, raises doubts

about OSIS' suitability as an archival format.</p>

  </li>

  <li>

    <p>In an environment where there has been a large perceived need

for an XML Scripture file interchange standard, OSIS has been around

for a very long time (in Internet years) without producing a

significant following among software developers or Bible translators.

There are a couple of notable exceptions (like The Sword Project), but

even then, I think that significantly slowed development on that

project.</p>

  </li>

  <li>

    <p>The mere thought that OSIS would be useful to us in the field

with the current set of support tools is laughable due to the overly

complex nature of that schema. OSIS is too complex for competent

programmers to fully grasp, let alone my typesetting staff. Defining a

&#8220;best practices&#8221; subset of OSIS is not sufficient to fix this problem.</p>

  </li>

  <li>

    <p>I find some of the tools provided so far for OSIS editing to be

intimidating from a security and usability standpoint. For example, I'm

not willing to even test the OSIS editor Word 2003 plugin on a

production machine because of the way it uses macros.</p>

  </li>

  <li>

    <p>Given all of the above, I consider OSIS to be dangerous, in that

it is consuming resources better applied elsewhere and discouraging

people from looking at alternatives.</p>

  </li>

</ol>

<h2 class="western">What I Like About USFX</h2>

<ol>

  <li>

    <p>All of the good things I said about USFM apply, because it is

essentially USFM converted straight to XML, and because XML is also a

simple text format.</p>

  </li>

  <li>

    <p>USFX is XML, so software tools like XSLT and XML parsing library

functions work with it.</p>

  </li>

  <li>

    <p>USFX is simple enough to embed in WordML.</p>

  </li>

  <li>

    <p>USFX extends USFM to allow generation of quotation punctuation

from markup, but does so in a way that keeps the control of that

punctuation in the control of the translators, not programmers who

don't know the language. It also provides a mechanism missing in OSIS

to allow Scripture file parsers who don't know the rules for generating

punctuation for a particular translation to just leave in place what

has already been generated. USFX readers with such knowledge also know

what generated punctuation to replace, place, or remove when

reprocessing an edited input file. (OSIS has no way to do that, at

least not without nonstandard extensions.)</p>

  </li>

  <li>

    <p>USFX extends USFM to get rid of the character style end

ambiguity of USFM.</p>

  </li>

  <li>

    <p>USFX could readily be extended to include elements of OSIS that

are not in USFM, like full Dublin Core metadata, if anyone cared to

make that an option. Alternatively, you could make a document with

mixed schemas, and just use DC + USFX.</p>

  </li>

  <li>

    <p>USFX is easy to convert to or from USFM with no loss of

information. Converters exist that work on Windows XP, Mac OS X, and

Linux. Even though USFX has virtually no following, now, USFM is the

most conservative, safe format to use for Scripture processing,

interchange, and archiving. Therefore, USFX inherits USFM's ease of

conversion from legacy SF texts.</p>

  </li>

</ol>

<h2 class="western">What I Don't Like About USFX</h2>

<ol>

  <li>

    <p>I invented it, first as an internal XML schema to be used within

the Onyx project, so people might think I am just rejecting OSIS for

NIH reasons or rabble rousing. (Actually, I tried to use OSIS directly

first, but that turned out to be a time sink for many reasons, some of

which are listed above.)</p>

  </li>

  <li>

    <p>USFX hasn't been subjected to much use, shake-out, and comment,

yet, so it still is probably shaky as an archiving format. Therefore,

if USFX is used, it should be converted to USFM and stored along with

USFM for archiving.</p>

  </li>

</ol>

<h2 class="western">Other Bible XML Schemas</h2>

<p>Some other options exist. Some, I have looked at. Some, I have

not. How sure do you want to be about the set of railroad tracks you

lift your locomotive onto?</p>

<h2 class="western">What I Recommend</h2>

<ol>

  <li>

    <p>Don't legislate OSIS as THE XML Scripture standard to use within

SIL or any of the Forum of Bible Translation Agencies. Please. It isn't

like we could actually use it, right now, anyway, because of its flaws

and lack of adequate software tool support.</p>

  </li>

  <li>

    <p>Look for better alternatives in XML Scripture encoding schemas.

Consider USFX, or better yet, improve on USFX or replace it with

something better. Don't waste more time on OSIS, except to study what

good ideas from it might be transferred to a better schema.</p>

  </li>

  <li>

    <p>Do not abandon USFM as a Scripture drafting, processing,

typesetting, and archiving standard until you have something better,

and then only if it is easy to fully automatically convert from USFM to

that new standard.</p>

  </li>

  <li>

    <p>Convert any Scripture texts that have been produced or archived

in OSIS back to something better, like USFM.</p>

  </li>

</ol>

<h2 class="western">Further Reading</h2>

<p><a href="http://ebt.cx/usfx/Bible-encoding.htm">http://ebt.cx/usfx/Bible-encoding.htm</a></p>

<p><a href="http://ebt.cx/usfx/">http://ebt.cx/usfx/</a></p>

</body>

</html>