[osis-users] Unambiguous and Consistent OSIS for Interchange: Stand-off Markup

Sun Jan 24 01:37:45 MST 2010

Attached is an example of what the ESV could look like as the result of a
web service API response for 1 John 5:7-8, including virtual elements and
stand-off markup. The relevant fragment:

<concurrent>
    <!--
    @virtual can be "start", "end", "both", or "none" (default)
    target attribute used by Open Siddur; Efraim Feinstein notes range()
    is a TEI-defined XPointer scheme:
    http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATS
    Alternative would be to use @sID and @eID
    -->
    <p virtual="both" target="#range(w6200500701, w6200500812)"
/><!--sID="w6200500701" eID="w6200500706b"-->
    <verse osisID="1John.5.7" target="#range(h6200500601, p6200500706)"
/><!--sID="w6200500701" eID="p6200500706"-->
    <verse osisID="1John.5.8" target="#range(w6200500801, p6200500812)"
/><!--sID="w6200500801" eID="p6200500812"-->
</concurrent>
<content><!-- isn't @scope="1John.5.7-1John.5.8" redundant here? -->
    <title ID="h6200500601" canonical="false" virtual="true">Testimony
Concerning the Son of God</title>
    <w ID="w6200500701">For</w>
    <w ID="w6200500702">there</w>
    <w ID="w6200500703">are</w>
    <w ID="w6200500704">three</w>
    <w ID="w6200500705">that</w>
    <w ID="w6200500706">testify</w><w ID="p6200500706">:</w>
    <w ID="w6200500801">the</w>
    <w ID="w6200500802">Spirit</w>
    <w ID="w6200500803">and</w>
    <w ID="w6200500804">the</w>
    <w ID="w6200500805">water</w>
    <w ID="w6200500806">and</w>
    <w ID="w6200500807">the</w>
    <w ID="w6200500808">blood</w><w ID="p6200500808">;</w>
    <w ID="w6200500809">and</w>
    <w ID="w6200500810">these</w>
    <w ID="w6200500811">three</w>
    <w ID="w6200500812">agree</w><w ID="w6200500812">.</w>
</content>

On Thu, Jan 21, 2010 at 9:40 AM, Weston Ruter <westonruter at gmail.com> wrote:

> Troy:
>
> I did say that since OSIS allows different ways to mark the same structure,
>> we have an importer which attempts to accept any valid OSIS doc and
>> _normalizes_ that doc into a form of OSIS we find easiest for our engine to
>> process.  It is still OSIS, just a form of OSIS with all structures
>> represented in a single way.
>>
>
> Thank you for clarifying this, and also for sharing some of this history
> behind the development of OSIS.
>
> [We chose to] augment the specification with a 'best practices' doc which
>> recommends a single specific method for encoding OSIS.
>>
>
> I don't think I have seen this best practices doc. Is this something you
> use internally at CrossWire as part of your importer script? Could you
> direct me to it? I like the approach you took, allowing varying OSIS
> encodings but recommending only one of them. This is similar to the
> development of XHTML 1.0 dialects, where you are allowed to use the
> Transitional doctype, but the Strict doctype is recommended. Doing this for
> OSIS could answer the need for an unambiguous single markup language. The
> best practices document would need to contain the practices that are
> endorsed by at least the majority of players; the others could abstain and
> still use their preferred (deprecated) dialect of OSIS. Along with this best
> practices doc, an official normalizer script that translates OSIS into the
> recommended encoding would be great.
>
> I look forward to your thoughts about stand-off markup encoding of OSIS,
> especially how well it might serve as the new recommended way to
> unambiguously encode OSIS.
>
> Thanks!
> Weston
>
>
> 2010/1/19 Troy A. Griffitts <scribe at crosswire.org>
>
> Weston Ruter wrote:
>>
>>> ... Troy, as you've said before, you can't actually use OSIS as your raw
>>> data format at CrossWire because an OSIS document can be authored in many
>>> different ways and so there is much more programming logic that is needed to
>>> handle all of the possible OSIS styles.
>>>
>>
>> Hey Weston,
>>
>> Hope to have time for a thoughtful response to more of your suggestions,
>> but just wanted to clear a couple things up first:
>>
>> I hope I never implied that we can't/don't use OSIS internally as our
>> primary markup standard.
>>
>> I did say that since OSIS allows different ways to mark the same
>> structure, we have an importer which attempts to accept any valid OSIS doc
>> and _normalizes_ that doc into a form of OSIS we find easiest for our engine
>> to process.  It is still OSIS, just a form of OSIS with all structures
>> represented in a single way.
>>
>> Even so, we still don't use any plain text format as our "raw data
>> format".  We typically compress and index documents when they are imported
>> into our engine.  You can ask our engine for OSIS, HTML, RTF, GBF, ThML, or
>> plaintext and it will do its best to give you the data in the requested
>> format.
>>
>> None of this to argue against your point: OSIS has multiple ways to encode
>> a single structure in a document.
>>
>> The real answer to this is not technical.  I too am frustrated with this.
>>  But many people working at many organizations were consulted when
>> developing the OSIS specification.  They gave great insights to how they
>> work.  Sometimes they even made demands with an ultimatum that they would
>> absolutely not use the specification if a certain feature was not added to
>> the spec.
>>
>> OSIS could have been technically finished in less than a year.  It took us
>> 3 years to get buy-in from all the participating organizations.
>>
>> In the end, the purpose of OSIS was to build collaboration between
>> organizations.  We could have developed a much easier to use technical
>> specification which no one would have used, or conceded to demands to gain
>> buy-in, and augment the specification with a 'best practices' doc which
>> recommends a single specific method for encoding OSIS.  We chose the later.
>>
>> Implementing code against the spec now, it makes our importer a pain in
>> the butt to write, but in the end, we get what we want: a single OSIS style
>> that our engine knows how to work with, and multiple supporting
>> organizations producing OSIS documents.
>>
>>
>> Troy.
>>
>>
>>
>>
>> If we could define a single document structure, however, one
>>
>>> that is a subset of the freedom that OSIS provides (perhaps taking cues
>>> from OXES), we could then have an XML format for scripture that would be
>>> suited for efficient interchange and application traversal.
>>>
>>> Currently we have the problem of two overlapping hierarchies: BSP and
>>> BCV. However, there could be potentially multiple versification systems, so
>>> there could be even more than two overlapping hierarchies, probably why the
>>> <p> element isn't currently milestonable. To get around the problem of
>>> overlapping hierarchies, what if we introduced stand-off markup into the
>>> equation? The words of scripture themselves could all be located in a flat
>>> structure as siblings; then in the header there could be multiple CONCUR
>>> sections (views) that list out the elements which belong to the various
>>> parts of the hierarchies
>>>
>>> For example, the current approach:
>>>
>>> <p>
>>>    <verse osisID="Example.1.1" sID="Example.1.1" />
>>>    <w id="w1">Then</w>
>>>    <w id="w2">he</w>
>>>    <w id="w3">said</w><w id="p1">,</w>
>>>    <q marker="“" sID="Example.1.1.q1" />
>>>        <w id="w4">Let</w>
>>>        <w id="w5">us</w>
>>>        <w id="w6">go</w><w id="p2">...</w>
>>> </p>
>>> <p>
>>>    <w id="w7">but</w>
>>>    <verse eID="Example.1.1" />
>>>    <verse osisID="Example.1.2" sID="Example.1.2"/>
>>>    <w id="w8">don't</w>
>>>    <w id="w9">forget</w>
>>>    <w id="w10">your</w>
>>>    <w id="w11">backpack</w><w id="p3">.</w>
>>>    <q marker="”" eID="Example.1.1.q1" />
>>>    <verse eID="Example.1.2" />
>>> </p>
>>>
>>>
>>>
>>> Could instead appear as (I'm making up these element names):
>>>
>>> <concur>
>>>    <view type="verse" osisID="Example.1.1" xpointer="range(#w1, #w7)" />
>>>    <view type="verse" osisID="Example.1.2" xpointer="range(#w8, #q2)" />
>>>    <view type="quote" xpointer="range(#q1, #q2)" />
>>>    <view type="para"  xpointer="range(#w1, #p2)" />
>>>    <view type="para"  xpointer="range(#w7, #q2)" />
>>> </concur>
>>> <content>
>>>    <w id="w1">Then</w>
>>>    <w id="w2">he</w>
>>>    <w id="w3">said</w><w id="p1">,</w>
>>>    <w id="q1">“</w><w id="w4">Let</w>
>>>    <w id="w5">us</w>
>>>    <w id="w6">go</w><w id="p2">...</w>
>>>    <w id="w7">but</w>
>>>    <w id="w8">don't</w>
>>>    <w id="w9">forget</w>
>>>    <w id="w10">your</w>
>>>    <w id="w11">backpack</w><w id="p3">.</w><w id="q2">”</w>
>>> </content>
>>> By structuring a document like this, multiple overlapping hierarchies can
>>> be cleanly defined, although they are separated from the underlying content:
>>> this however, provides the benefit of clearing up the confusion as to where
>>> the <verse>, <p>, and <q> elements should be placed: in the concur section,
>>> they each can share references to the same content elements and so their
>>> boundaries are specified at the exact same location. This means that XML
>>> processors would be able to consistently handle each of the hierarchies as
>>> they interweave throughout the content data.
>>>
>>> Efraim Feinstein and James Tauber introduced me to this approach to
>>> structuring markup. See also:
>>> http://www.tei-c.org/Guidelines/P4/html/NH.html#NHCO
>>>
>>> Weston
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/osis-users/attachments/20100124/005b70b6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1John.5.7-1John.5.8.ESV.xml
Type: text/xml
Size: 2614 bytes
Desc: not available
URL: <http://www.crosswire.org/pipermail/osis-users/attachments/20100124/005b70b6/attachment-0001.xml>