[sword-devel] Food for thought regarding OSIS and some of its alternatives...

Wed Feb 8 07:11:21 MST 2006

Hello, Chris!

First, please let me apologize. By the personal attacks in your message,
you may have gotten the impression that I was attacking you personally
instead of promoting the idea that OSIS is not appropriate to set forth
as a standard format for use in Bible translation and Scripture file
archiving. I don't think the latter would be a good thing, for the
reasons I set forth. It is apparent that your knowledge of the current
state of OSIS vastly exceeds mine, since my opinion of OSIS is based on
the last openly published documentation I was able to retrieve from the
OSIS web site. There are apparently some web publishing problems, as
well as a significant lag in time between OSIS committee decisions and
implementation of those in the documentation. Please forgive me if you
think it unfair that I judge OSIS based on what I can see of OSIS on the
web.

Chris Little wrote:
>
> Kahunapule Michael Johnson wrote:
>
>>     XML Myths Debunked
>>
>> Myth 1: Anything in XML is inherently better for archiving and
>> processing than non-XML formats. *False.* XML is just a set of rules
>> defining how text files can define data, with tags, attributes, and
>> contents being easily separated and parsed. One disadvantage of XML
>> is that it forces strict nesting of elements, making it an awkward
>> basis for Biblical texts. (This shortcoming is easy to overcome using
>> milestones, which are empty elements that mark the beginning or end
>> of something. Unfortunately, there are some ways of doing that which
>> are error-prone and not elegant, like OSIS does.)
>
> Sorry? Did you have a better suggestion?
Yes: a better XML schema for some applications, and sticking with USFM
for other applications.

>> Myth 2: XML is better than SF for processing because of the software
>> tools available for processing XML. *False.* There are some good
>> tools available for processing XML and transforming it to other
>> formats, but there are also pretty simple SF parsers available, too.
>> Implementing the latter is actually simpler than the former.
>
> You are being dishonest. SF has inherent structural ambiguities. USFM
> fixes these for the most part, so someone with a process around (U)SFM
> should probably keep that until they are ready to transition to
> OSIS--at which time they can use the USFM to OSIS converters for
> fairly painless transition.
I do not intend to be dishonest or to deceive anyone. I have nothing to
gain by doing so. USFM does fix most of the structural ambiguities (and
will soon fix the only remaining one that I'm aware of).
>> Myth 3: If data is expressed in XML, it can easily be transformed to
>> other formats. *False.* The data can only be transformed to other
>> formats if all required information for the target format is present
>> in the source format, and segregated with the same granularity.
>> Furthermore, the programming skills necessary to perform these
>> transformations are specialized knowledge that it is not reasonable
>> to expect the average computer user to be fluent in. (The “average”
>> computer user is probably challenged to understand a tree-structured
>> directory file system, let alone XSLT.)
>
> Your statement that this is "False" is both incorrect and
> intentionally misleading. The fact that you can't magically invent
> extra data that never existed via an XSLT doesn't mean that data in
> XML cannot easily be transformed using the data that DOES exist.
> Easiness is obviously relative. I expect that there are many more
> people in the world who are capable of writing an XSLT than a CC
> table. Not to mention, books and copious online material exist to lead
> a person through the former, but not the latter. Thus, if the question
> is whether XML is easier to transform to other formats than
> (U)SFM/GBF, the answer is yes.
That was not the question or the answer. Again, I don't intend to
mislead anyone. I claim that both XSLT and CC tables are beyond most
computer users. The ease of conversion depends heavily on the schema(s)
involved in the XML, the consistency of the data, and the meaning
applied to the markup-- much more so than just if the data is in XML or
not. Maybe *not always true* would be a better statement than *false*
when considering competent programmers instead of average users.

>
>>    2.
>>       It is well documented, and the documentation is maintained and
>>       published in accessible formats (HTML and PDF in a way that is
>>       easy to mirror on a notebook computer taken to a remote village).
>>       The latest documentation is easy to find and clearly labeled with
>>       its revision date.
>
> *True* The same goes, more or less, for OSIS. I know the manual is
> still in progress, but a new version was posted a few months ago.
> Basic "commonly-used features" have certainly been covered for a while.
Much less than more. Your statements lead me to believe that what I
found online recently must be way out of date. Either that or we simply
draw much different conclusions from the same observations.
>>       The maintainers of USFM are responsive to comments and mindful of
>>       backward compatibility issues when they make changes.
>    3.
>
> *LOL* Awww. I feel like someone may still have his panties in a bunch
> even after he was given a solution for encoding presentation form of
> quotation marks in OSIS.
Is this the way a member of the OSIS committee regards serious input
from someone representing the concerns of many potential OSIS users?
>>   10.
>>       In the unlikely event that USFM would be inadequate for a
>>       particular language or translation, it would not be difficult to
>>       extend it for whatever unusual circumstances might come up.
>
> *True*, but this is only good if the maintaining body does the
> extension. Same is true of OSIS.
So, I should pick a standard ruled by a maintaining body that has a good
track record for responsiveness and respectful treatment of those who
use the standard. Right?
>>   12.
>>       USFM is simple enough to program for that it can be used with low
>>       power computing devices.
>
> *Huh?* I don't see the relation between power and programming
> complexity. Maybe this is a criticism of memory intensiveness of the
> DOM; in which case, don't use DOM.
It isn't just DOM, but serious memory and processing power limits
compared to what you are used to on a normal PC.
>
>>     What I Like About OSIS
>>
>>    3.       USFM data can be converted to OSIS automatically if you
>> accept
>>       some modifications to the OSIS documented standard, and if you
>>       don't mind adding some metadata from other sources. It is a little
>>       awkward, and may involve loss of some metadata, but it is
>> possible.
>
> In my experience, USFM->OSIS is a lossless conversion. I could be
> wrong here, but I don't remember losing any data, no matter how
> minute, in the conversions I've done. And those were all automatable.
In your experience, you were encoding different data and almost
certainly a different OSIS variant than what I encountered in my experience.
> Your mention of quotation marks is just a red herring.
If you really believe that, and if you have significant say in the
maintenance of OSIS, then you are doing more damage to the acceptance of
OSIS than anyone else.
> You've had the use of the n attribute on q elements explained to you.
> You could have disseminated that knowledge further, but instead chose
> to ignore it and lie about a non-existent deficiency in OSIS.
I did not choose to lie. The deficiency is real in the latest official
OSIS documentation that I could find on the web when I last looked (last
week). The n attribute was explained to me as a potential, probable
future change to OSIS, not a done deal.
> And, as Troy mentions, we will probably make quotation marking even
> simpler via a defaulting mechanism definable in the header.
Probably?
>>    2.
>>       The latest documentation I read on OSIS indicated that it was
>>       improper to put quotation punctuation directly in the text,
>>       instead requiring it to be converted to markup-- a process that is
>>       difficult, if not impossible to do automatically, especially
>>       without detailed language-specific information.
>
> *Mostly False* The OSIS Manual (2.1 draft, Appendix K, Conformance
> Requirements) doesn't actually specify anything about marking
> quotations with <q> vs. ". If it does, in a later version, I would
> expect to see that conformance requirement appearing at either level 2
> or 3. (I would say it is at least implicit in the level 3
> requirement.) So, while it may be arguably "improper to put quotation
> punctuation directly in the text", it is certainly not a conformance
> requirement, meaning that documents which fail to do so are poor OSIS,
> but OSIS nonetheless.
My comments are true based on the latest OSIS manual I read.
>>    3.
>>       OSIS Scripture files are not self-contained with respect to all of
>>       the meaning-based markup of the text, unlike USFM, except in some
>>       simple cases.
>
> *False* There are plans to permit including portions of external
> documents in an OSIS document, but that isn't really a top priority at
> the moment.
Again, your perception of reality differs from mine. Maybe it is a
documentation problem. Maybe we look at the same thing and draw
different conclusions.
>>    5.
>>       OSIS has no mechanism for encoding “red letter” editions of Bibles
>>       other than <q> tags, and those could be interpreted by OSIS
>>       readers to mean that punctuation should be inserted, even if the
>>       target language and style forbids such insertion.
>
> *Grossly Misleading* This is actually explicitly addressed in the
> manual. Go read it.
I did. My statement stands, based on the currently published manual.
I'll revise it when I see the corrected manual publicly posted.
> This complaint is as valid as complaining that OSIS does nothing to
> ensure that completely daft XSLT writers don't convert <q who="Jesus">
> to HTML <blink>. We can't prevent you from writing bad stylesheets.
No, but you can clearly document what a good stylesheet would or would
not do.
> If you're checking for the who attribute in order to generate red
> letters, why would you then generate punctuation if your objective is
> to not do so.
Perhaps you don't intend OSIS to be a reliable Scripture file
interchange standard: one in which one person writes a file and another
person reads it, both with reference to the OSIS documentation, but
without exchanging additional information like intention to generate
punctuation or not that is not explicitly and clearly encoded in the
OSIS document. If that is truly the case, then perhaps my complaints are
all irrelevant, as is the OSIS standard irrelevant to all of the
applications that I might possibly consider using it for.
>>    6.
>>       OSIS takes the control of quotation punctuation out of the hands
>>       of the translators and gives it to the programmers who write the
>>       programs that interpret the OSIS.
>
> *False* If the encoder wants to do that, I suppose they can rely on
> defaults. If they want to be explicit, they can use the n attribute on
> q elements. In the future, they'll probably be able to explicitly
> define document default quotation punctuation.
Again, I call it like I read the currently published OSIS documentation.
You obviously are either interpreting things differently than I do
(meaning the documentation is too ambiguous) or looking at a different
set of documentation than I am (meaning there is probably a problem with
publishing the documentation).
>
>>    7.
>>       OSIS does not support footnote range start tags for easy hyperlink
>>       generation.
>
> *False* Once again, you've explicitly been told how to do this.
Not in the currently published OSIS documentation.
>>    8.
>>       Handling of minor variations in versification is awkward in OSIS.
>>       Older attempts at documenting OSIS made a stab at handling this,
>>       but currently published documentation doesn't even address this
>> issue.
>
> *False* Versification variations are handled by assigning different
> workIDs in osisIDs.
Again, we must be working from different documentation and different
dialects of OSIS.
>
>>    9.
>>       OSIS parsing is unnecessarily complex mostly due to the fact that
>>       it does not handle the overlapping of book/chapter/verse,
>>       quotations, and book/section/paragraph or stanza/verse/line
>>       hierarchies of Scripture texts well. It really has multiple ways
>>       of handling these, and OSIS readers have to deal with all of them,
>>       adding unnecessary complexity.
>
> *Not really* The hierarchies are set up so that easy stuff is easy (to
> encode and parse) and so that hard stuff is possible. But
> book/section/paragraph is the best practice. Chapter/verse should be
> handled via milestones. And accordingly stanza/line will probably end
> up being milestones. In easy cases, you can probably get away with
> more containers.
Actually, stanza/line goes neatly in the book/section/paragraph
hierarchy in place of paragraph -- pick any number of stanzas or
paragraphs, in any order inside of a section. Stanzas are just poetry
paragraphs as opposed to prose paragraphs.
>>   10.
>>       Start/end tag matching identifiers are used where they really
>>       wouldn't be required, and add unnecessary complexity to OSIS
>>       generation. This isn't a big deal for program-generated OSIS, but
>>       it is probably enough all by itself to push the complexity past
>>       what most OWLs can handle error-free for manual OSIS generation
>>       with a text editor.
>
> The matching identifiers are necessary to overcome the
> single-hierarchy requirement.
Actually, they are not. There are several other ways to do this, some of
which are simpler and more elegant.
> It is simple for programmatically-generated OSIS, and I wouldn't want
> to type out full OSIS docs in a text editor. It's certainly possible
> with simple texts, but complex texts are better suited to an XML or
> OSIS-specific editor or word processor macros.
Agreed. USFM is simple enough to type in a text editor. OSIS is possible
to do that way, but it would be most unpleasant and error-prone. A
specialized OSIS editor would be much better.
>>   12.
>>       The current OSIS standard is not easy to find on the OSIS web
>>       site, and the documentation that is there is downlevel.
>
> Probably this is something you should point out to the maintainers of
> the site. It's easy enough to fix.
You probably care more, at this point. :-)
>>   15.
>>       The unnecessary complexity of OSIS means that software written to
>>       read and write will be more expensive, take longer to write, and
>>       probably contain more bugs than software written to a simpler
>>       standard, even though a simpler standard could do anything OSIS
>>       could do.
>
> *False* Oh... Okay... So, when writing software that uses an XML
> standard, it will take much more time and expense than if one were to
> use USFM? Are you kidding me? With OSIS you can use an off-the-shelf
> XML parser.
You are entitled to your opinion, of course. As for me, I already have a
reusable, open source USFM reader/writer component written, so it makes
little difference which markup I use. That, however, is not my main
point. The OSIS schema and associated documentation are much more
complex than they have to be, causing the software that reads and writes
OSIS to be more complex, and therefore less reliable and more expensive
than it would be with a simpler XML schema (like USFX) or with USFM.

>>   17.
>>       OSIS is too complex to embed in WordML along with working typeset
>>       text.
>
> I don't know what would make OSIS "too complex" for WordML. It's
> certainly transformable to WordML, if you really want to use Word for
> typesetting.
Trust me. I tried it.
>>   18.
>>       OSIS could be made usable with some minor modifications, but there
>>       is no indication that those modifications would ever be made.
>
> The OSIS TC is open to suggestions, which are evaluated and generally
> included if they're appropriate.
No extra charge for verbal abuse in response to the suggestions, and
don't hold your breath looking for documented updates...

>>   22.
>>       I find some of the tools provided so far for OSIS editing to be
>>       intimidating from a security and usability standpoint. For
>>       example, I'm not willing to even test the OSIS editor Word 2003
>>       plugin on a production machine because of the way it uses macros.
>
> ??? You're afraid of macros? Didn't you cite macros as a selling point
> of USFM above? Did you know that executables can also contain
> malicious code?
I am not afraid of malicious code, but buggy or poorly designed code
that makes global changes to the way Microsoft Word 2003 behaves on a
production machine used for Scripture typesetting. The documentation for
the OSIS editor Word 2003 plug-in recommends dropping the security level
for macros, which in turn provides an open door for macro viruses from
other sources to waltz right into my computer. If you don't care about
computer security, or if you have extra computers laying around to test
things on, go for it. For me, the benefit (testing another way to
directly edit Scripture files using an XML schema that I probably won't
ever use anyway) did not outweigh the risk of messing up my Scripture
typesetting environment or dropping one line of malware defense. I'm
sure your evaluation of the same situation would vary from mine. :-)

Chris, by all means keep promoting and using OSIS, if you think that is
best. Please pardon any annoyance competition may bring, as it will
either motivate improvements in OSIS or replacement of OSIS with
something better. Either way, most of us win.