[osis-core] osisGenRegex: General Statement

Steven J. DeRose osis-core@bibletechnologieswg.org
Tue, 21 Oct 2003 11:24:32 -0400


(cc to philosophers is due to my <soapbox> bit at the bottom)

At 12:57 -0700 2003-10-20, Chris Little wrote:
>Patrick,
>
>My hope/desire for attributes like lemma would be that they be easy to
>parse and easy for humans to understand.  Towards that end, it is my hope
>that we come up with something that makes "Strong:G1234" the only valid
>way to refer to Strong's Greek lemma number 1234.
>
>If we want to say that encoders should have a work named "Strong" listed
>in the <works>s, that's fine with me.  If we want to add a prefix like
>"osisRef" that indicates what follows is a valid osisRef, that works for
>me too.  (e.g. someone wanted to use Mounce's numbering for some kind of
>limited lemmatization, so he might have a work with ID "Mounce" and lemma
>attributes like "osisRef:Mounce:123".
>
>To answer your questions, I would say a prefix should always be required. 
>It reduces ambiguity, and I'm not concerned with filesize issues.
>
>Regarding whether they should point to a work in the header, I don't think
>they should, necessarily.  Lemmata are fairly limited in number.  Strong
>enumerated about 14,000, I think.  Morphological tags are frequently just
>patterns of slots with variables that can fit into each slot.  They're
>defined algorithmically rather than by enumeration.  For example, a
>morphological tag for English pronouns might consist of "NP-"  followed by
>slots for person, number, gender, and case (72 possible tags, just for
>pronouns).  I think it is unlikely, in some cases, that a document would
>ever be made to hold all the possible values of some tag systems.  For the
>Sword Project, we have two morph tag indexes, both of which are based on
>algorithmic systems.  However, the indexes themselves are not exhaustive,
>but are based on all of the tags that actually occur in those specific
>texts that happen to be coded to them.
>
>We could still create a work element in the header that does not refer to
>an actual document that will ever exist.  But I fail to see the worth in
>doing so.  Better to just tell people that Strong: and GK: mean one thing,
>osisRef: means you have a real document and are referencing osisIDs in it,
>and x- (if we keep it) means you don't care about standards.

Well, that's exactly what W3C XML Namespaces do -- though I might 
still agree in failing to see the worth in that in the form 
Namespaces does it. As I recall I argued that namespace declarations 
should point to an actual declaration of the names; some people 
argued that was wrong-headed; some argued that we didn't have time to 
spec out the name declaration file. That last argument was 
indisputable....

>
>But honestly, I won't complain if we just set them all back to x-[^\s] and
>deal with the problem in 2.1 or whatever comes next.
>

Generally I'm the last person to care about verbosity in markup; but 
my taste shifts when we get down to word-level stuff, just because 
there's *so much* of it. So being able to define, say, an "s" prefix 
for strongs, that points to a work (whether abstract or tangible), 
saves 6 bytes per word -- for 790,871 words (according to a cute 
little site at http://groups.msn.com/MyBibleFacts/biblestats.msnw 
that I just found), that saves about 4.5MB -- nothing to sneeze at on 
a work as small as the Bible.

<soapbox resp="sjd">
Also, I personally view the notion of formal naming, with formally 
defined names and sets of names, as an even bigger value of OSIS than 
the markup itself (technically, I'd call the markup just a special 
case of formal naming anyway). Dumb stuff like getting everybody to 
say "Matt" instead of 20 variations for the name of the gospel, to 
use "." instead of varying punctuation, and to have a way of portably 
saying "this is a strong's number", or "this is the john who authored 
the gospel", is what starts to enable the next level of processing -- 
where we can at least start trying to do formal logic and reasoning 
over the text.

Once we have texts with all the lemmas defined and disambiguated for 
word sense, and all the individual people, places, and objects 
identified, it becomes feasible to start building a Prolog 
representation of the Bible (or properly, of our interpretations): 
where we interpret passages into formal logic assertions. I suspect 
the trickiest part is probably time-indexing them, e.g., when we 
assert a king is a sole head of a country, the OT doesn't appear to 
contradict itself because there were many kings of Israel and Judah 
throughout history. The beauty of this is not that we will succeed -- 
I believe the Bible is probably ineffable enough that we won't. But 
(! just like XML), this helps get us to where we can pose assertions 
and questions formally, and that in itself will give us an incredibly 
valuable tool for interpretation and for seeing and attacking the 
much more substantial next layer of questions.
</soapbox>

S
-- 

Steve DeRose -- http://www.derose.net
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@acm.org  or  steve@derose.net