[sword-devel] Food for thought regarding OSIS and some of its alternatives...

Tue Feb 7 22:13:03 MST 2006

Kahunapule Michael Johnson wrote:
> 
>   Why Use OSIS When USFM and USFX Work Better?

Because OSIS is an open standard, available to all and open to the input 
and contributions of all; USFM and USFX are not. Oh yeah, not to mention 
OSIS is in most ways superior (i.e. works better than USFM or USFX) 
unless your primary interest is in working with existing tool sets (e.g. 
Paratext and typesetting tools), in which case you should probably use 
USFM. There exists no valid reason to use USFX.

>     Conclusion

A conclusion, by definition, goes at the end of a document. You might 
mean something like "abstract", though you don't present what is 
typically considered an abstract here. Or you might mean "conclusions", 
which, at least, can be read metaphorically since you do present 
multiple summary points. ("Conclusion" cannot unless you present a 
single point of summary, which you do not.)

> (If you don't know what the Open Scriptural Information Standard is, you 
> can stop reading, now, and ignore both that proposed standard and this 
> document.)

OSIS isn't a proposed standard. It's a real, accepted standard offered 
by its standards body. You might want to consult ISO or the IETF if the 
concepts of proposed vs. accepted (ratified) standards.

>     XML Myths Debunked
> 
> Myth 1: Anything in XML is inherently better for archiving and 
> processing than non-XML formats. *False.* XML is just a set of rules 
> defining how text files can define data, with tags, attributes, and 
> contents being easily separated and parsed. One disadvantage of XML is 
> that it forces strict nesting of elements, making it an awkward basis 
> for Biblical texts. (This shortcoming is easy to overcome using 
> milestones, which are empty elements that mark the beginning or end of 
> something. Unfortunately, there are some ways of doing that which are 
> error-prone and not elegant, like OSIS does.)

Sorry? Did you have a better suggestion? Perhaps you misread the 
documentation or don't understand basic Bible structure.

Anything in XML probably IS better for archiving and processing than 
non-XML formats--not inherently, but by virtue of the myriad tools 
available for processing XML and by virtue of the faults that are 
frequently present in non-XML formats. What XML does is provide 
standardization for interoperability, which IS better for archiving and 
processing.

> Myth 2: XML is better than SF for processing because of the software 
> tools available for processing XML. *False.* There are some good tools 
> available for processing XML and transforming it to other formats, but 
> there are also pretty simple SF parsers available, too. Implementing the 
> latter is actually simpler than the former.

You are being dishonest. SF has inherent structural ambiguities. USFM 
fixes these for the most part, so someone with a process around (U)SFM 
should probably keep that until they are ready to transition to OSIS--at 
which time they can use the USFM to OSIS converters for fairly painless 
transition.

> Myth 3: If data is expressed in XML, it can easily be transformed to 
> other formats. *False.* The data can only be transformed to other 
> formats if all required information for the target format is present in 
> the source format, and segregated with the same granularity. 
> Furthermore, the programming skills necessary to perform these 
> transformations are specialized knowledge that it is not reasonable to 
> expect the average computer user to be fluent in. (The “average” 
> computer user is probably challenged to understand a tree-structured 
> directory file system, let alone XSLT.)

Your statement that this is "False" is both incorrect and intentionally 
misleading. The fact that you can't magically invent extra data that 
never existed via an XSLT doesn't mean that data in XML cannot easily be 
transformed using the data that DOES exist. Easiness is obviously 
relative. I expect that there are many more people in the world who are 
capable of writing an XSLT than a CC table. Not to mention, books and 
copious online material exist to lead a person through the former, but 
not the latter. Thus, if the question is whether XML is easier to 
transform to other formats than (U)SFM/GBF, the answer is yes.

>     Why I Like USFM
> 
>    1.
>       It is simple to understand, use, and program for. It is simple
>       enough to expect at least 50% of ordinary working linguists (OWLs)
>       to be able to understand and edit even in a plain text editor, at
>       least with the commonly-used features of it.

*False* USFM is at least as simple to misunderstand and misuse as it is 
to understand and use. And it's relatively difficult to program for. Not 
to mention, ALL parsing has to be done by the application since 
pre-existing parsers don't exist for USFM, whereas they do for XML 
(libxml, MSXML, Xerces, etc.).

If you're going to limit yourself to "commonly-used features" then 
OSIS/XML fits the same bill of being usable by "50% of ordinary working 
linguists" in a plain text editor. Increase that percentage 
significantly if they're allowed to use an XML editor, an OSIS-specific 
editor, or a plug-in for their favorite word processor.

>    2.
>       It is well documented, and the documentation is maintained and
>       published in accessible formats (HTML and PDF in a way that is
>       easy to mirror on a notebook computer taken to a remote village).
>       The latest documentation is easy to find and clearly labeled with
>       its revision date.

*True* The same goes, more or less, for OSIS. I know the manual is still 
in progress, but a new version was posted a few months ago. Basic 
"commonly-used features" have certainly been covered for a while.

>    3.
>       The maintainers of USFM are responsive to comments and mindful of
>       backward compatibility issues when they make changes.

*LOL* Awww. I feel like someone may still have his panties in a bunch 
even after he was given a solution for encoding presentation form of 
quotation marks in OSIS.

>    4.
>       USFM is close enough to the (depreciated-but-still-used) PNG SFM
>       that updating to USFM is reasonably painless. (In most cases, just
>       a few global search-and-replace operations do it.)

*Cool* And then converting to OSIS is as easy as running one of the USFM 
to OSIS converters. Or someone could modify my OSIS to USFM converter, 
which is covered by the BSD license, to convert directly between PNG SFM 
and OSIS.

>    5. 
>       USFM is well-enough defined that it makes programming tools to
>       read and write USFM easier to create and maintain than doing the
>       same for generic SFM.

*True* Definitely. SFM stinks compared to USFM. :)

>    6.
>       USFM provides a real and practical measure of cross-entity
>       portability for Scripture texts, opening up more options for
>       typesetting, checking, and software tool creation and use.

*True* Not to mention, quick forward transition to OSIS using the USFM 
to OSIS converters.

>    7.
>       USFM takes full advantage of the time-tested practical aspects of
>       SFM in the experience of Bible translators from multiple
>       organizations, making incremental improvements where appropriate.

*True* Incremental improvements are okay, provided they come from the 
USFM maintainers. My experience (and something I was actually taught to 
do in a class where I was taught SFM) is that people who use SFM tend to 
chart new territory at the drop of a hat (i.e. make up new tags). XML 
schema validation goes a long way towards preventing that, but I'm sure 
something similar exists for preventing use of non-standard tags in 
USFM. But then, you'll still have to convince all those people who've 
been taught to make up new tags that they need to stop doing that, even 
though they're using the same old encoding system.

>    8.
>       USFM is a simple text-based, easy-to-parse format that is robust,
>       can be read by many software tools, and will not go obsolete due
>       to the obsolescence of any one software tool or company. It is
>       trustworthy for archiving purposes.

*True* of USFM and (to a lesser extent with respect to software tools) 
OSIS, but not of any of the other formats you consider.

>    9.
>       USFM allows the unambiguous encoding of all essential elements of
>       Scripture texts that I'm interested in encoding, including every
>       PNG language, and for that matter, the Scriptures and essential
>       peripherals (footnotes, section titles, etc.) for any language I
>       anticipate encountering.

*Wow* suddenly it became /all about you, all about you, Michael/. Well, 
I can think of some fairly fancy things I care about encoding that USFM 
can't handle (half-lines, present in Sanskrit and early Germanic 
poetry--but maybe I only thought of that because it came up in a class 
today). OSIS handles that. I don't think USFM can much of the word-level 
markup we use in Sword (which we can encode because we use OSIS, which 
does handle it). Not only can OSIS handle encoding of everything I'm 
INTERESTED in encoding, I think it can handle encoding of any scripture 
that I can imagine anyone wanting to encode.

>   10.
>       In the unlikely event that USFM would be inadequate for a
>       particular language or translation, it would not be difficult to
>       extend it for whatever unusual circumstances might come up.

*True*, but this is only good if the maintaining body does the 
extension. Same is true of OSIS.

>   11.
>       USFM has good software support with Paratext, various Microsoft
>       Word macros, Adapt It, Onyx, and various other programs. Future
>       support is being developed in the JAARS Translation Editor.

*True* of USFM and (to a lesser extent) OSIS. There is some overlap in 
the inventory of specific programs supporting USFM & OSIS.

>   12.
>       USFM is simple enough to program for that it can be used with low
>       power computing devices.

*Huh?* I don't see the relation between power and programming 
complexity. Maybe this is a criticism of memory intensiveness of the 
DOM; in which case, don't use DOM.

>   15.
>       There is no problem encoding any of the common variants in
>       versification.

*True* -- nor is there a problem with OSIS. And you can encode multiple 
versifications simultaneously within a single document to permit 
single-source generation of customized Bibles in different versifications.

>     What I Don't Like About USFM
> 
>    3.
>       USFM does not support footnote range start tags for easy hyperlink
>       generation, but most SIL members would never miss this function.

Truly I don't know that anyone would ever use this. The only place I've 
ever seen it explicitly encoded in a format is in GBF. It's fairly easy 
to automate markup of this in OSIS, provided you actually have the data 
for where to start/end footnote marking.

>     What I Like About OSIS
> 
>    3. 
>       USFM data can be converted to OSIS automatically if you accept
>       some modifications to the OSIS documented standard, and if you
>       don't mind adding some metadata from other sources. It is a little
>       awkward, and may involve loss of some metadata, but it is possible.

In my experience, USFM->OSIS is a lossless conversion. I could be wrong 
here, but I don't remember losing any data, no matter how minute, in the 
conversions I've done. And those were all automatable.

>    4.
>       OSIS documents can be converted to USFM if you can accept some
>       potential loss of data, in the cases where either the quotation
>       punctuation rules are simple or where the generator of that text
>       modified OSIS to make lossless conversion possible.

OSIS->USFM is potentially lossy. We ensured that every marker in USFM 
has some correlate in OSIS. But, as I mentioned, USFM doesn't have a 
marker for all OSIS elements, attributes, and types.

Your mention of quotation marks is just a red herring.

>     What I Don't Like About OSIS
> 
>    1.
>       The quotation and speech markup is incomplete with respect to
>       multiple languages and styles, making it impossible to be sure
>       that OSIS readers would generate and display the correct quotation
>       punctuation for a given translation without extra external
>       information. OSIS does not define or provide a way of providing
>       that extra information, nor is it obvious how that information
>       should be supplied. Therefore, OSIS files are not self-contained
>       with respect to all important Scripture meaning-based data like
>       USFM is.

Your mention of quotation marks is just a red herring.

You've had the use of the n attribute on q elements explained to you. 
You could have disseminated that knowledge further, but instead chose to 
ignore it and lie about a non-existent deficiency in OSIS.

And, as Troy mentions, we will probably make quotation marking even 
simpler via a defaulting mechanism definable in the header.

>    2.
>       The latest documentation I read on OSIS indicated that it was
>       improper to put quotation punctuation directly in the text,
>       instead requiring it to be converted to markup-- a process that is
>       difficult, if not impossible to do automatically, especially
>       without detailed language-specific information.

*Mostly False* The OSIS Manual (2.1 draft, Appendix K, Conformance 
Requirements) doesn't actually specify anything about marking quotations 
with <q> vs. ". If it does, in a later version, I would expect to see 
that conformance requirement appearing at either level 2 or 3. (I would 
say it is at least implicit in the level 3 requirement.) So, while it 
may be arguably "improper to put quotation punctuation directly in the 
text", it is certainly not a conformance requirement, meaning that 
documents which fail to do so are poor OSIS, but OSIS nonetheless.

>    3.
>       OSIS Scripture files are not self-contained with respect to all of
>       the meaning-based markup of the text, unlike USFM, except in some
>       simple cases.

*False* There are plans to permit including portions of external 
documents in an OSIS document, but that isn't really a top priority at 
the moment.

>    4.
>       USFM and legacy SF texts cannot be fully automatically converted
>       to fully conformant OSIS with respect to quotation handling
>       without some serious manual intervention or language-specific
>       programming.

*Grossly Misleading* "Full" conformance would be level 4: Scholarly OSIS 
  document. That requires tagging all significant names as well as 
including some sort of scholarly apparatus (even as simple as Strong's 
numbers or extensive translation notes). Of course that can't be 
achieved in most USFM to OSIS conversions because the data simply isn't 
present (often isn't even encodable) in the USFM basis.

Since this complaint is just another rephrasing of your complaints about 
in-text quotation punctuation, I'll reiterate that failure to mark 
quotation marks with <q> instead of " qualifies as level 1 conformance 
if not level 2.

>    5.
>       OSIS has no mechanism for encoding “red letter” editions of Bibles
>       other than <q> tags, and those could be interpreted by OSIS
>       readers to mean that punctuation should be inserted, even if the
>       target language and style forbids such insertion.

*Grossly Misleading* This is actually explicitly addressed in the 
manual. Go read it.

This complaint is as valid as complaining that OSIS does nothing to 
ensure that completely daft XSLT writers don't convert <q who="Jesus"> 
to HTML <blink>. We can't prevent you from writing bad stylesheets. If 
you're checking for the who attribute in order to generate red letters, 
why would you then generate punctuation if your objective is to not do so.

>    6.
>       OSIS takes the control of quotation punctuation out of the hands
>       of the translators and gives it to the programmers who write the
>       programs that interpret the OSIS.

*False* If the encoder wants to do that, I suppose they can rely on 
defaults. If they want to be explicit, they can use the n attribute on q 
elements. In the future, they'll probably be able to explicitly define 
document default quotation punctuation.

>    7.
>       OSIS does not support footnote range start tags for easy hyperlink
>       generation.

*False* Once again, you've explicitly been told how to do this.

>    8.
>       Handling of minor variations in versification is awkward in OSIS.
>       Older attempts at documenting OSIS made a stab at handling this,
>       but currently published documentation doesn't even address this issue.

*False* Versification variations are handled by assigning different 
workIDs in osisIDs.

>    9.
>       OSIS parsing is unnecessarily complex mostly due to the fact that
>       it does not handle the overlapping of book/chapter/verse,
>       quotations, and book/section/paragraph or stanza/verse/line
>       hierarchies of Scripture texts well. It really has multiple ways
>       of handling these, and OSIS readers have to deal with all of them,
>       adding unnecessary complexity.

*Not really* The hierarchies are set up so that easy stuff is easy (to 
encode and parse) and so that hard stuff is possible. But 
book/section/paragraph is the best practice. Chapter/verse should be 
handled via milestones. And accordingly stanza/line will probably end up 
being milestones. In easy cases, you can probably get away with more 
containers.

>   10.
>       Start/end tag matching identifiers are used where they really
>       wouldn't be required, and add unnecessary complexity to OSIS
>       generation. This isn't a big deal for program-generated OSIS, but
>       it is probably enough all by itself to push the complexity past
>       what most OWLs can handle error-free for manual OSIS generation
>       with a text editor.

The matching identifiers are necessary to overcome the single-hierarchy 
requirement. It is simple for programmatically-generated OSIS, and I 
wouldn't want to type out full OSIS docs in a text editor. It's 
certainly possible with simple texts, but complex texts are better 
suited to an XML or OSIS-specific editor or word processor macros.

>   11.
>       There is a fair amount of ambiguity in the OSIS standard, leading
>       to doubts about reliable compatibility between different software
>       products using OSIS to interchange data.

*Fear! Uncertainty! Doubt!* I suspect ambiguity is inevitable (in OSIS 
or USFM).

>   12.
>       The current OSIS standard is not easy to find on the OSIS web
>       site, and the documentation that is there is downlevel.

Probably this is something you should point out to the maintainers of 
the site. It's easy enough to fix.

>   13.
>       OSIS has inadequate software support for drafting, checking, and
>       publishing Scriptures.

*True*--more software would be nice. As with most new technologies, new 
support infrastructure is necessary.

>   14.
>       I have yet to see reliable converters between OSIS and USFM. (I
>       have written an OSIS writer myself, but it was impossible to
>       complete without “cheating” on the OSIS standard a little, making
>       modifications that the OSIS committee seems to be unwilling to make.)

Wasn't this points 3 & 4? USFM->OSIS is easy. OSIS->USFM should be 
possible to the extent that USFM supports OSIS's level of markup.

>   15.
>       The unnecessary complexity of OSIS means that software written to
>       read and write will be more expensive, take longer to write, and
>       probably contain more bugs than software written to a simpler
>       standard, even though a simpler standard could do anything OSIS
>       could do.

*False* Oh... Okay... So, when writing software that uses an XML 
standard, it will take much more time and expense than if one were to 
use USFM? Are you kidding me? With OSIS you can use an off-the-shelf XML 
parser.

>   16.
>       The OSIS schema I used to program to when testing its suitability
>       could not handle simple things like supplied text (KJV italics)
>       within a Psalm title.

Now you're getting petty. That's not exactly a difficult thing to 
identify to the TC and get changed in the schema. But I understand... 
you're searching hard to complaints to make so that you can sell a new 
solution to a non-existent problem.

>   17.
>       OSIS is too complex to embed in WordML along with working typeset
>       text.

I don't know what would make OSIS "too complex" for WordML. It's 
certainly transformable to WordML, if you really want to use Word for 
typesetting.

>   18.
>       OSIS could be made usable with some minor modifications, but there
>       is no indication that those modifications would ever be made.

The OSIS TC is open to suggestions, which are evaluated and generally 
included if they're appropriate.

>   19.
>       OSIS could never be made simple enough to be elegant and to save
>       on software development costs without sacrificing backward
>       compatibility. To really fix it, it would be better to replace it
>       and provide a conversion tool for legacy text. This, in turn,
>       raises doubts about OSIS' suitability as an archival format.

*Nonsensical*

>   20.
>       In an environment where there has been a large perceived need for
>       an XML Scripture file interchange standard, OSIS has been around
>       for a very long time (in Internet years) without producing a
>       significant following among software developers or Bible
>       translators. There are a couple of notable exceptions (like The
>       Sword Project), but even then, I think that significantly slowed
>       development on that project.

*False* OSIS is mostly used behind the scenes, so you're not especially 
likely to find out about its use except in The SWORD Project and JSword. 
There's some use in the translation field. Converters have been written 
at at least one commercial Bible software publisher. And it hasn't 
slowed development here at the SWORD Project. If anything, it has sped 
it because we don't have to care about inferior formats like GBF and 
ThML anymore, so our energies are focused on developing things once 
rather than repeatedly (for each format).

>   21.
>       The mere thought that OSIS would be useful to us in the field with
>       the current set of support tools is laughable due to the overly
>       complex nature of that schema. OSIS is too complex for competent
>       programmers to fully grasp, let alone my typesetting staff.
>       Defining a “best practices” subset of OSIS is not sufficient to
>       fix this problem.

Just because you can't fully grasp OSIS doesn't mean that we competent 
programmers can't. For the subset of OSIS necessary for encoding basic 
Bibles, OSIS is not especially complex.

>   22.
>       I find some of the tools provided so far for OSIS editing to be
>       intimidating from a security and usability standpoint. For
>       example, I'm not willing to even test the OSIS editor Word 2003
>       plugin on a production machine because of the way it uses macros.

??? You're afraid of macros? Didn't you cite macros as a selling point 
of USFM above? Did you know that executables can also contain malicious 
code?

>   23.
>       Given all of the above, I consider OSIS to be dangerous, in that
>       it is consuming resources better applied elsewhere and
>       discouraging people from looking at alternatives.

*Fear! Uncertainty! Doubt!* Lock up the children! OSIS is on the loose!!!!

>     What I Like About USFX
>     What I Don't Like About USFX

I'll just skip this since no one cares about USFX.

--Chris