[osis-core] RE: Comments on CEV markup samples from a colleague

Mon, 12 May 2003 12:28:48 -0600

See my comments below.

Kees, can you tell me who "Dave" is and can you forward this to "Dave".

> -----Original Message-----
> From: Kees F. de Blois [mailto:kdeblois@biblesocieties.org]
> Sent: Monday, May 12, 2003 2:45 AM
> To: 'Todd Tillinghast'
> Cc: 'Steve DeRose'; 'Patrick Durusau'; Bob Hodgson
> Subject: Comments on CEV markup samples from a colleague
> 
> Dear Todd,
> 
> I am sharing with you the first comments on your CEV markup samples
from a
> member of one of our regional translation/publishing computer task
teams,
> Dave van Grootheest of the Netherlands Bible Society. He is raising a
> couple
> of interesting issues.
> 
> Cheers,
> 
> Kees
> 
> =========================================
> 
> Dear Kees and others,
> 
> I would like to comment on two things:
> (1) The way in which osisIDs and splitIDs are used
> (2) The lack of a direct representation of verse boundaries and verse
> numbers
> 
> (I hope that my comments are not too detailed or too technical.)
> 
> 
> (1) The way in which osisIDs and splitIDs are used
> --------------------------------------------------
> 
> (This is not an entirely new issue - some aspects of it were discussed
> before.)
> 
> It would seem that in the samples, verse content is always embedded in
a
> "verse" element.
> That verse element has an osisID attribute that identifies the verse
> involved, e.g. "Gen.48.5".
> 
> If a verse is split across paragraphs or poetry lines, each verse part
is
> embedded in its own verse element. Each of these elements has the same
> osisID attribute (identifying the verse involved). Moreover, each of
the
> elements has a splitID attribute with a value that is identical to
that of
> the osisID attribute. For example:
> 
> <p>
> 	<verse splitID="Gen.48.5" osisID="Gen.48.5">[...]</verse>
> </p>
> <p type="indent">
> 	<verse splitID="Gen.48.5" osisID="Gen.48.5">[...]</verse>
> 	<verse osisID="Gen.48.6">[...]</verse>
> 
> It is not very easy to see the usefulness of the splitID attributes,
given
> the presence of the osisID attributes. Why have another attribute with
the
> same value?
> 
> If I am correct, the reasoning behind having both was that splitIDs
can be
> used in other situations as well, i.e. when splitting something else
than
> a
> verse, in which cases there will not be an osisID with the same value
as
> the
> splitID. In other words, it is a matter of consistency. However, this
> matter
> of consistency places a fairly heavy burden on the use of the verse
> element:
> in each case where a verse is split across paragraphs or poetry lines,
> there
> should be a splitID, with a value that will in fact be equal to that
of
> the
> osisID.

You are correct about the use of other elements that may not have an
osisID but are split.

The idea of simply using the osisID to identify when an element is split
was considered.  
1) The first reason to use (and retain) as you pointed out is that not
all elements may be encoded with an osisID.  There is benefit in
consistency.
2) The second reason is that it is possible for two <verse> elements to
have the same osisID and NOT be split.  If two elements are not split
but simply carry the same osisID then there would be not way to
distinguish between the two cases. (In most cases, however the value of
the osisID attribute will be unique, even if a single identifier value
is included in more than one osisID attribute.)

Part of the argument made seems to be that a large number of extra
characters will be in the encoding.  XML is not know for its compactness
and I don't believe that there will be many people directly encoding
Bibles (few, if any, will actually by typing pointy brackets, rather
people will be using a translation tool or an OSIS XML editor).  From
that perspective the size or presence of the splitID attribute does not
seem to be a compelling reason to do something special for split <verse>
elements.

> 
> I seem to remember that in an earlier stage, splitIDs were also used
for
> splitting quotations. However, in the current CEV samples this seems
to be
> dealt with through milstoneStart and milestoneEnd elements (quotations
> that
> are not split still seem to have the "q" container element). Although
the
> documentation for milestoneEnd in the OSIS 1.1.1 schema mentions the
use
> of
> identical osisID and splitID attributes, these attributes do not seem
to
> be
> used for split quotations in the CEV samples (see, for example, Isaiah
> 29:16
> and Mark 16:6-7).

You can encode a quote in either way (milestones or split elements), in
the same way you can encode a chapter either way.  After having encoded
texts in several ways I have come to the conclusion that it makes more
sense to use milestones for elements that would be split across several
levels of the predominant hierarchy.  For example a chapter if encoded
as a <div> element in a document that gives preference to section <div>
and <p> elements, will have a chapter <div> that is a child element of a
<p> element and then a chapter <div> element that contains sibling <p>
elements to the <p> element that contains the same chapter <div>.  The
same is true of quotes.  The further problem for chapters is that a <p>
element may not have a <div> element as a child, which would force a
special case when encoding chapters with <div> rather than milestone.
(If how this can play out is unclear I can encode an example).

If forced to choose between milestones for both chapter and verse and
elements for both chapter and verse, I would go with elements for both,
and split <p> elements where they overlap with chapter <div> elements.
But I think that having a <div type="chapter"> as the parent of a <div
type="section"> sometimes and as child other times is undesierable and
find it even more undesirable that the fragments of a <div
type="chapter" splitID="xyz"> element could be both parents and children
of <div type="section"> elements.

> 
> Some more remarks:
> - The use of splitIDs does not seem to be explicitly required in the
OSIS
> schema. In other words, it might more or less be a "best practice"
issue.
> - It would only be fair to acknowledge that requiring the use of
splitIDs
> does not necessarily have to burden a text structure that is still
being
> worked on: in principle, the splitIDs could be generated as a final
step.
> 

The schema can not force the use of a splitID because not all elements
will be split.  However, the use of splitIDs IS REQUIRED the standard
when an element is split.  This is not an issue of "best practice".  If
two elements have the same value in the osisID attribute value there is
a different meaning than if two elements have the same value of the
osisID attribute and the splitID attribute.  

The splitID need not be the value of the osisID.  I simply find it a
convenient mechanism to maintain unique values.  The splitID could be
"elephant", "tiger", "tree", or any value you choose.

We considered a "split" Boolean flag that relied upon matching osisIDs,
but decided against it in favor of consistency with elements that may
not naturally encoded with an osisID.

> 
> The "burden" of additional splitIDs would, in a sense, seem to be even
> heavier for bridged verses. This has to do with the fact that in the
> samples, the osisID attribute for a bridged verse has the form of a
> space-separated list of individual-verse osisIDs. Again, a split verse
> also
> has a splitID attribute with the same value.
> 
> A relevant example would be Genesis 48:8-10. This bridged verse is
split
> across three paragraphs. The start tag for the content of these
paragraphs
> is as follows:
> 
> <verse n="8-10" splitID="Gen.48.8 Gen.48.9 Gen.48.10" osisID="Gen.48.8
> Gen.48.9 Gen.48.10">
> 
> Now imagine a case in a Bible version where not three, but many more
> verses
> are bridged. A case in point would be Genesis 10:6-20 in CEV. That
would
> result in the following start tag:
> 
> <verse n="6-20" splitID="Gen.10.6 Gen.10.7 Gen.10.8 Gen.10.9 Gen.10.10
> Gen.10.11 Gen.10.12 Gen.10.13 Gen.10.14 Gen.10.15 Gen.10.16 Gen.10.17
> Gen.10.18 Gen.10.19 Gen.10.20" osisID="Gen.10.6 Gen.10.7 Gen.10.8
Gen.10.9
> Gen.10.10 Gen.10.11 Gen.10.12 Gen.10.13 Gen.10.14 Gen.10.15 Gen.10.16
> Gen.10.17 Gen.10.18 Gen.10.19 Gen.10.20">

Again the splitID need not be the same value as the osisID.  The above
could be encoded in either of the two ways indicated below.

<verse n="6-20" splitID="Gen.10.6.split-1" osisID="Gen.10.6 Gen.10.7
Gen.10.8 Gen.10.9 Gen.10.10 Gen.10.11 Gen.10.12 Gen.10.13 Gen.10.14
Gen.10.15 Gen.10.16 Gen.10.17 Gen.10.18 Gen.10.19 Gen.10.20">

or 

<verse n="6-20" splitID="abc" osisID="Gen.10.6 Gen.10.7 Gen.10.8
Gen.10.9 Gen.10.10 Gen.10.11 Gen.10.12 Gen.10.13 Gen.10.14 Gen.10.15
Gen.10.16 Gen.10.17 Gen.10.18 Gen.10.19 Gen.10.20">

I suspect that the first strategy would be used if encoded by hand, but
I don't have a problem with the example you state either.

> 
> As the bridged verse involved is split across seven paragraphs, this
tag
> would occur seven times.
> 
> There may be even more extreme cases in CEV or other versions (does
anyone
> know?). How long can such space-separated lists become? I am not aware
of
> any length limitations for XML attributes as such, but I could imagine
> that
> an actual XML-processing application may use some sort of limitation
in
> that
> regard (or wouldn't that be an issue in this case?).

I don't believe there is a length limitation.

> 
> Of course, this system of space-separated lists does have the
advantage
> that
> references to an individual verse are relatively easy to find.
> 

This was a carefully debated strategy for osisIDs and I strongly support
the list of identifiers within osisIDs.  However, splitID is simply a
string and IS NOT a list from the perspective of the standard or XML.

> 
> (2) The lack of a direct representation of verse boundaries and verse
> numbers
>
------------------------------------------------------------------------
--
> --
> -
> 
> The samples have milestoneStart and milestoneEnd elements to indicate
> chapters. That would seem to make sense. At the verse level, however,
> there
> is not a similar explicit indication of where a verse starts or ends.
> Verse
> boundaries are more or less a matter of calculation. (Actually, as
long as
> splitIDs are used for split verses, the matter is relatively
> straightforward
> for verse elements without a splitID.)

The start and end boundaries are precisely and unambiguously encoded.
If no splitID is present the verse starts at the first of the <verse>
element and ends at the end of the <verse> element.  If a <verse>
element has a splitID, then the verse starts at the first of the first
<verse> element with the matching splitID and ends at the end of the
last <verse> element with the matching splitID.

This is VERY simple to handle with either XSLT transformations
(stylesheets) or software parsers.

The benefit of always encoding scripture text within a <verse> element
is that it is much easier to handle with XSLT transformations and with
software.  For example you can ask for all <verse> elements with a given
osisID and you will get the needed <verse> elements.  If the verse
boundaries are encoded as milestones then you have to walk through all
of the elements between the start and end milestones and determine based
on context if the text is a part of the verse or some other text, based
on context.

If you want to walk along the elements between the start of a verse and
the end of a verse using a "range" function, either milestones or verse
elements offer the same value.

If you simply want to know when to mark the superscripted verse number
when rendering, you simply check if the <verse> element is split and if
it is check to see if this is the first instance.

Finally there seem to be two opposing perspectives regarding making the
Book-Chapter-Verse hierarchy the predominant hierarchy vs making the
Book-Section-Paragraph the predominant hierarchy.  The first strategy is
more convenient for document users that want to extract/use the text
from a query perspective (usually within Bible software) that are less
(or uninterested) in the section and paragraph information, the second
is better suited for publishing and translation.  I believe that the
needs of Book-Chapter-Verse oriented users are easily accommodated by
encoding using <verse> elements rather than verse milestones.  I also
believe that the Book-Section-Paragraph oriented users are better served
by using <verse> elements rather than verse milestones. 

You may ask what about the chapter milestones.  The reality is that
because osisID identifiers are hierarchical, the Book-Chapter-Verse
oriented users (as well as publishing users) need not every use the
chapter milestones.  I am not, however, advocating eliminating them.

One final reason to use <verse> elements rather than verse milestones is
that a document may be split at section or paragraph boundries.  In
those cases, the start or end milestone can be left out of the document,
which would make "range" functions unusable or unreliable (if code was
not written to check to see if only one of a milestone pair is present).
However, in the same cases if a verse is split into more than two
<verse> elements it would not be possible to determine if you had the
first or the last fragment.  This would be possible if (as you suggest
below) a splitType attribute were to be added to the schema (possible
values ("first", "intermediate", "last", or "1 of 3", "2 of 3", "3 of 3"
where the " of " is required if the attribute is present.  I am not sure
I am in favor of the change though.)  Also as suggested below the
presence of the "n" attribute on only the first fragment of a split
element.

> 
> A related issue is that verse numbers (that, in Bible publications,
are
> often printed at the beginning of a verse or in the margin) are not
> directly
> represented. When needed, they must more or less be calculated (in
terms
> of
> position and value), e.g. through a stylesheet.
> 

First this CAN be considered presentation issue and it would be
perfectly acceptable for the rendering process to "compute" the verse
number value that is normally superscripted.  However, I have preserved
in the "n" attribute the necessary value.  I consider this a "best
practices" issue.

> Given the fact that in (U)SFM, the start of a verse and the verse
number
> are
> explicitly marked, one might hesitate about losing that explicit
marking
> when converting to OSIS, as the information involved may later have to
be
> reconstructed when needed (possibly more than once).
> 
> Several options might be considered:
> - Marking the start of a verse by a "milestone" element
> - Marking the start of a verse by a "milestoneStart" element, and the
end
> of
> a verse by a "milestoneEnd" element (like what has been done in the
> samples
> for chapters)
> - Adding a special attribute to the first "verse" element of a split
verse
> - Adding a special attribute to the first "verse" element of each
verse
> (including ones that are not split verses)
> - Adding a special attribute to any "verse" element of a split verse
> except
> the first one (this attribute could have distinct values for the
second
> "verse" element, the third one, etc.)
> 
> The last option would seem to give the least "overhead" in terms of
> additional elements or attributes. On the other hand, it does rely on
> calculation of the verse-number value on the basis of the osisID,
which
> may
> be problematic or undesirable in some cases (is it?).
> 
> 
> Regards,
> 
> Dave
> 
> 
> Dear all,
> 
> Here are some additional notes to issue (2).
> 
> I have noticed that for bridged verse in the samples, the verse
element
> has
> an "n" attribute that contains the verse-number range. This would mean
> that
> when it comes to adding verse numbers for bridged verses, the
verse-number
> value would not have to be "calculated" on the basis of the osisID,
but
> can
> be taken from the "n" attribute (which would make things a little
easier).
> 
> In fact, one might consider using this "n" attribute for all "verse"
> elements, so that the verse-number value never has to be derived from
the
> osisID. (In the samples, the "n" attribute seems to be used for all
> chapter
> milestoneStart / milestoneEnd elements.) That would seem to eliminate
the
> possible disadvantage that I mentioned for the last option presented
under
> (2). (On the other hand, it does add quite a few bytes, and quite a
bit of
> redundancy. That could be a topic of its own: how desirable is
redundancy
> in
> markup? For example, how desirable is it to have a separate attribute
for
> a
> simple value that could also be derived from another attribute with a
more
> complex value? It may well make processing easier, and it may enhance
the
> opportunities for error recovery; but it also opens the door, at least
in
> principle, to errors in the form of discrepancies that should not be
> there,
> which may require extra checking. Altogether, it might be wise to have
> minimal redundancy whil!
> e a marked-up text is being developed, and to introduce some
redundancy at
> the end through calculation, if that is desirable in order to prevent
such
> calculation from having to be redone on several occasions. Would that
be a
> viable course of action? Actually, the idea of generating splitIDs as
a
> final step would be a case in point. By the way, does XML Schema allow
a
> requirement that the value of a particular attribute be equal to that
of
> another attribute? I would have to check it - does anybody know?)

The use of the "n" attribute is NOT required by the standard and may not
be endorsed by other members of the BibleTechnologies core working
group.  That is my strategy and one that I think is very useful.
Especially when you consider that an osisID may also have identifiers
from more than one reference system and that not all verse identifiers
are integers.  (ie osisID="Matt.1.3 Matt.1.4 Matt.1.5 Matt.1.6a Matt.1.6
Matt.1.2" and osisID="Ps.44.1 heb:Ps.44.2 fr:Ps.45.1" are valid osisID
but do not make for easy determination of the value to render.)

There is certainly an exposure to inconsistency between the n and the
osisID attributes.  I think that the use of "n" should be a best
practice and is worth the exposure.  It is also not that hard to write
an XSLT transformation to pull out all instances where there are more
than one identifier in an osisID and/or an "n" attribute.  Using the
result the consistency could be confirmed.  Better yet would be a
general purpose XSLT transformation that does do the computation, but if
you had that then you could just put it in your XSLT transformation in
the first place. 

Another reason to use the "n" attribute is for cases where the rendered
language does not use "western" numerals.  I rendered an Arabic Bible
and found that I didn't have the characters for the numerals.  This
however may be a poor reason to use the "n" attribute, because an
external mapping between the osisID and the language specific numeral
representation can be provided by the rendering process, but would
require more from the rendering process and require more than one
document/file to render the OSIS encoded Bible.  Again we may be better
off to write the complicated "function" to go in an XSLT transformation
and not use the "n" attribure.

> 
> Alternatively, the "n" attribute might be used only for "verse"
elements
> that are the first part of the verse involved, so that it could serve
to
> indicate the start of a verse, more or less like the fourth option
> mentioned
> under (2). (Having an "n" attribute for each split-verse "verse"
element
> does not seem to be very useful anyway - except maybe if, for example,
one
> wants to extract a section that starts in the middle of a verse, and
one
> would still want to display the verse number of that verse.)

This could be used at the discretion of the encoder and should be
discussed as a potential "best practice" but would not change the need
to properly encode "splitID" attributes.

> 
> 
> Regards,
> 
> Dave
> 
> 

Todd