[osis-core] Free-range grain-fed sacred cows

Steve DeRose osis-core@bibletechnologieswg.org
Tue, 30 Jul 2002 13:13:57 -0400


At 11:23 AM -0700 07/22/02, Troy A. Griffitts wrote:
>>>self-identify some of the more contemporary texts that versify by
>>>paragraphs. e.g. "This paragraph is Mark 1:1-9"
>>>
>>
>>
>>In this case the "1-9" becomes a verse name in the current reference
>>system in same way that "4" is a verse name in "Gen.1.4".  Other options
>>were discussed in recent posts.
>
>This seems useless, unless self referencing in the same document. 
>If I have a Bible such as this installed in some Bible software 
>application, and I have a commentary that has a <reference> to 
>Mark.1.7, I would hope my Bible would jump to the "This paragraph is 
>Mark 1:1-9" paragraph.  I don't think we decided how to do this.
>
>In Dallas, we decided to force these types of Bibles to have 
>multiple milestone starts, so we could still, easily do a 
>string-match reference resolution system.
>
>e.g.
><verseStart ref="Mark.1.1" />
><verseStart ref="Mark.1.2" />
>...
>
>
>Now that we're using containers, I'm not sure how we've decided to 
>allow this.  I still think it's not a trivial jump we're making if 
>we decide to allow ranges.  I'm not necessarily against it, but am 
>concerned about the complexities introduced.  The multiple milestart 
>start solution was brainless and made for easy implementation.  I 
>could write XPath to resolve to any versification reference that 
>this Bible claims to implement.  In the range solution, this is no 
>longer true.

I'm also still pretty nervous about using ranges there.

It does force extra work on software, and the algorithm doesn't seem 
entirely obvious to me. Why not just put the individual markers all 
in? It does put a burden on those editions that only mark by 
paragraphs (or whatever); but that burden can be automated by a 
utility that expands the markup for them before they release the text 
as being in OSIS; those with smart software won't even have to notice 
(they type in whatever they want and it expands underneath). Such a 
utility is much simpler than the range-intersection algorithm, *and* 
it only has to be implemented once, rather than implemented within 
every separate piece of OSIS-supporting software (editing, 
typesetting, retrieval, browsing...).

Also, it seems to me structurally incorrect -- something like 
Mark.1.1-3 is not an identifier as I understand it -- it is a 
structured expression that *uses* other identifiers. I think if you 
asked most laymen what that string means, they would be hard pressed 
to say anything about it without referring to those other 
identifiers. Thus, I claim this is an expression, conceptually.

Note that this looks like a range, but isn't. The syntax and 
semantics are not the same as the range Mark.1.1-Mark.1.3. They're 
related, but a range reference involves selecting on 3 keys and 
concatenating the results (or something faintly like that); the 
compound-verse identifier is a special kind of key semantics, where 
retrieval has to be smart enough to know that a variety of query keys 
will match this (meta-) key value in the data. Quite different 
implementation issues.

Also, it isn't really *just* another identifier token -- the numbers 
have constraints like a range would (like having to be in order). 
Also, weird numbering systems would make the implicit loop not work 
-- for example, what it one version has marked Matt.1.2a (now *that* 
seems to me like a real identifier -- just a token to match), and you 
click on it to find parallels. The loop that expands Matt.1.1-3 is 
not going to generate the '2a' reference; and heaven forbit that 
anyone should number their verses backwards (seems unlikely, but in 
this business I wouldn't bet much money against it happening 
somewhere).


I recently realized that this gets messier when we cross it with the 
idea of using grains to mark the parts of discontiguous verses.

My first problem with grains for this is that it seems conceptually 
incorrect -- we defined grains as being for mechanically identifying 
locations within the smallest units -- that is, as the escape for 
dealing with finer-grained addressing that the system allows. But:

1) users will seldom request the part of a verse that we had to break 
off into a separate part because it was right after an embedded quote 
(etc. etc) -- and if they do want something like that, they can't 
readily predict what the grain identifier for it would be.

2) Using grains to identify these parts conflates 2 separate notions 
(as I think Harry pointed out earlier): tieing together the parts, 
vs. identifying the whole.

3) We raise new error conditions: What if the grain identifiers on 
the parts do not in fact evaluate to those parts? For example, it 
says @char(44) but in fact the first character within is character 
45, or 50, or 200? Is it a validity condition that these be right?

4) The fact that there *can* be a contradiction, suggests that the 
data is non-normalized in a slightly dangerous way. In practice, this 
leads to situations where it is extra work (human or automated) to 
keep such things in sync. Thus:

    a) the identifiers will creep off as editing occurs, and authors will not
       be happy about having to fix them

    b) what happens when a new edition comes out with slight changes 
-- all these
       identifiers become invalid? If these were really identifiers I think they
       shouldn't die so easily.

    c) how does this support re-ordering? If parts of the verse occur out of
       order, their grain values will too; in which case the semantics of grains
       are ambiguous:

       i) In a grain used in a self-id, the grain is definitive: regardless of
          where this piece occurs, the self-id's grain constitutes an assertion
          that you are at this grain position. This massively complicates
          the implementation of grain-finding (it ain't just counting anymore)

       ii) In a grain used in a reference, the grain is a query: you must search
           for it.

This then raises the nasty case that for any re-ordering, there will 
be grain-references that could lead to two places. For example, 
consider:

     <z id='John.1.1@char(01)'>In the beginning </z>
     <z id='John.1.1@char(22)'>the Word</z>
     <z id='John.1.1@char(18)'>was </z>

Not a great example, but I think it will do.

Now, where does a reference to John.1.1@char(19) lead? to 'a' or to 
'h'? Better, yet, does a reference to John.1.1@char(22) lead to 't' 
or to 'W'? Is @char(3) out of range, does is point to 'w'?

5) Authors will also have trouble generating these beasties in the 
first place; thus we impose a barrier of software support between us 
and acceptance/use.


I think we'd be alright simply putting the whole verse's ID on each 
part, and let them be distinguished via the next/prev stuff. Yes, it 
does mean that you get 3 'hits' (or whatever) for a verse retrieval. 
Oh, and if this is stored in an RDB, when you get the 3 partial-verse 
records back, you can sort correctly in the face of reordering, only 
if you have nexxzt/prev, but not if you just have grains. 
Concatenating by chaining through the next/prevs seems to me easier 
to implement.

Note also that users are already used to one kind of partial-verse 
identifier that doesn't have most of these problems (though they have 
not been very formal to date): appending a letter to designate 
successive parts of a verse.

So my proposal for this would be to disallow grains on self-ids, and 
to either suggest or (preferably?) require appending a, b,... to the 
identifiers if you want to make the parts accessible (those should be 
declared in a reference scheme, but I'm not picky about that part, 
seems like a  nit at this point). This gives us nice predictable 
names for the parts of a discontiguous verse (say, for use in the 
next/prev values), and makes a trivial algorithm to  strip them off. 
Indeed, we could use the default fallback algorithm, which is to 
strip off trailing tokens of a reference; to do that we just put '.' 
before the 'a'.

S
-- 

Steve DeRose -- http://www.stg.brown.edu/~sjd
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@speakeasy.net
Backup email: sjd@stg.brown.edu