[osis-core] Notes from OSIS meetings of 2004-01-31

Steven J. DeRose osis-core@bibletechnologieswg.org
Sat, 31 Jan 2004 18:40:34 -0500


on examining usfm, todd & chris found that the only things that 
didn't map easily to osis, were qr quad qc for quadding poetry (only) 
to left and center.

Proposal: +n for indents from left (actually start, to account for 
r-to-l lgs), -n from right, and 0 for centered.

What would a translator accept as reasons for a tanslator to use 
these -- 'it looks good' would be iffy; inconsistency would be 
rejected; reason should have to do with linguistic.

Proboem of automated conversion does not imply we have to not have a 
finer grained system -- these two codes are remnants of formatting 
orientation in sf.

Can we extract a set of *reasons* for people using qc and qr?

are we overloading notion of "line" for typographic vs. structure?

Maybe for conversion purposes, we just let them use type attribute?

Or 'kind'?

We could enumerate some reasons, and allow center/right if you don't know.

possibility: type='unknown' subtype='center|right'

translators don't want to worry about all these distinctions.

yes, but they also often don't want to worry about lots of other 
distinctions (being focused on getting the Bible published) -- but 
letting them just use formatting (Word file, format macros....) in 
fact costs them more time.

can checkers gradually develop a list of the qc and qr related types?

possible consensus:

line type attribute is for meaninfgul types, to be determined by agency

line type of 'unknown' with subtype for typography

can usfm do more? e.g. enumerate appropriate meaningful uses of 
qc/qr, and then (a) at least add those to the doc for those tags; and 
(b) if it makes sense, add tags for them.

Develop a usfm/osis manual -- guidance for mapping.

Can USFM define what qr means in right-to-left languages?

Can USFM define more meaningful alternatives to qc and qr?

Can they add an inscription tag?

Should we enumerate? start/center/end? Pro: validation.

consensus: add type=unknown; add enumeration (extensible) for 
justification types. ql/qc/qr, left/center/right, 
left/center/right/start/end; start/center/end

Narrowed by poll to l/c/r or l/c/r/s/e. Finish on list.


Note to editors: should we separate out all potentially-enumerable 
attributes into a schema type?

Welcome David Haraburta, Baylor CS student working with Kirl L.

--------------

Switching to Linguistic Annotation

(intro linguistics summary)

Levels of analysis: phonology, morphology, syntax, discourse

morphology/part-of-speech annotation

sometimes determination requires arbitrary amounts of contexts

no neat 1:1 mapping from categories to features (like part-of-speech)

For example:

look up a hill
look up a word

but
look a hill up (wrong, at least for the meaning like 'look up a hill')
look a word up (fine, and means same as corresponding example above)

--> "look up" is a verb with a space in it (and which can be shifted 
even further apart.

Also, conjoint forms like "don't", etc.

Hebrew 'melek', can't tell if it's construct or not without more context.

A single word instance may even have different parts of speech in 
different clauses that include it, although this is rarer.

Lemma vs. morphology: lemma says what root "word"; morphology says 
what grammatical form, etc.


Issue of recording obsolete systems accurately (e.g., Strong's lemma 
numbers even when they're now deemed wrong), and also being able to 
express modern consensus, and individual's annotations that may not 
conform to any "standard" taxonomy.

(much harder above morphological level)

issue of ambiguity.


Consider Eagles work on dfining feature sets for EU languages.


Problems:

1) How to link in to OSIS texts

    e.g. add analysis to a text that has <w> -- expand capability of w

2) Inline versus out of line markup

[[sjd: Is there a schema construct for saying "any" attribute permissible?

3) Should we introduce a <morpheme> level tag?

(question of using namespaces)

First approach: add <morpheme>, everything goes on attributes.

Second approach: un-flatten it into element structures

model: Provide a large set of features, hoping to cover vast majority 
of lgs; but provide a way to subtract values inapplicable in any 
given language.

[[sjd: provide a way to pull in the definition file and then 
add/subtract features/values

[[problem: if we go with portmanteau references a la top level of TEI 
fs, we have to enumerate all the combinations, and users have to 
enumerate all their deletions -- couldn't practically delete "dual" 
number with a single statement. We could provide a way of expressing 
structure inside the values, like a token in the value for each 
feature expressed, and a way to name each level and associate its 
contxt in the attribute value/reference string, with the particular 
feature name

E.g. n-n-m-s

would express a pattern of
    category=noun
    case=nominative
    gender=masculine
    number=singular

then you could get rid of the dual value for the number feature with 
someting like:

    for "delete n-*-*-d

Or, someone could make a simple interface for making changes (which 
might map a request to a whole lot of trivial cases). Or, we could 
give them the TEI fslib we create, and let them literally delete/mod 
as needed. Could get a tool built, too.

Would we be able to keep to one sequence, fairly flat like here, or 
do we need some kind of parenthesizing inside the attributes (if the 
latter, it quickly gets complicated enough that it belongs in element 
markup instead -- which, however, has the problem of forcing users to 
touch the schema to change the tag vocab.


Kirk: tried TEI fs's at the start. Hard to find actual examples, 
usage guidance. Exists 1844 distinct feature-spec strings in BHS: one 
char for part-of-speech, etc.

To use fs's, seems you would have to have a GUI, because too many 
features to memorize.

Much easier if you make the idrefs to the feature structures be 
exactly the (for example) BHS mnemonics.

Could also split out lg universals (say, pos and context-Boundedness) 
to separate attributes

Can separate question of whether to use TEI fs's, and whether to 
provide multi-level (morpheme vs. word level) annot.

<seg granularity='word|morph|...'>....

(case of ciscontiguous words/morphemes, so need some pointing 
mechanism -- can you do this in TEI? Like, binding a feature to a 
word-instance value.

fs are more palatable with the mnemonics -- still need a good UI for 
real users to have a chance.

issue of tings like TEI global attributes..... just select the fs 
module, drop any global attrs we don't use (see current.

possibility: use namespace prefix to identify fslib in header, and 
people can add their own attributes to w and m elements to add their 
own features.


summary:

Define schema (per TEI) for fslibs

Dcl such fslibs as works of class 'fsd':
    <work osisWorkID="class">
       <identifier type='osis'>fsd.he.WHI.2004...</>
      ...
    </work>

Then refer to them via the prefixing mechanism:
    <m feature='class:pro....'>...</m>

Next prob: combining discontiguous parts:
    features can be referenced from w or m, or a generic <wordpart>.....
    on discontiguous things, link them up

what about milestones? insufficient for discontiguous.

TEI defined <join> for this.... sits somewhere (a type of link) and 
points to all the parts. goes into a joingroup in an anonymousBlock.

Problem of duplicate osisIDs -- several meanings with no way to tell 
(for example a verse):

1) Discontiguous portions of a verse

2) Multiple distinct copies of a verse from different works (diglots, 
parallels)

3) Multiple copies of the same verse from the same work (in a commentary, say)

4) Alternate readings of the same verse (end of Mark)

5) Combinations of the above.

We have a problem waiting out there when implementors have to decide 
how ot process duplicate osisIDs.



Seems like we're re-inventing TEI bit by bit....


Problem: how to mark up discontiguous constituents? Gotta have a 
pointer across; but then, where do we hang properties of the whole?

PLan:

after lunch:

    troy item

    last few usfm issues

    gorier bits of features

Documentation:

     Comments on current state of doc, AND on pld's disposition of 
prior comments, are due by Feb 15.
     Hard date -- anything not raised by then is left to editors' 
discretion (if any).

     Comments on changes made after now, will be accepted later than Feb 15.



----- After lunch:

Troy's cases:

(cf notes from pld on first two issues)

3: How to mark up this in a lexicon:

This word occurs one hundred fifty seven times in the NT.

Need a way to get machine-rpocessable numeric value.

<seg type="x-occ"> or similar is the right solution for the whole sentence.

Where to put "157"?

Should not go as content of an embedded seg, because it is not a 
segment of the source content at all, but a property of it.

If it's an attribute, it should be on a <seg> surrounding the content 
that it is a property (normalization) of, namely "one hundred fifty 
seven".

None of the existing elements or attributes really fit (though many 
are syntactically possible).

TEI <num> element would be nice for this. Takes type and value.

We should specify a single format for the 'value' attribute: some XSD 
numeric type that covers floats and integers.
Types: card/ord/pct/frac -- plus x-

Should we also add measure?
It has samples weight, count, length,  area volume, currency.
We also need time durations (xsd has a duration type)
TEI timeRange doesn't really do duration

length, mass, charge, angle, solid angle, temperature, time,

meter, kg, sec, amp, K, mole, candela


Minimal set: length, time, currency, volume, count, area, mass.

reg is a pure numeric value.

add unit attribute: pick from somewhere.....


last issue: chars alllowed in morph values. esp. hyphen, as in 
"n-nm-s" etc. Right now we have the same regex controlling these and 
osisRefs.... so you end up with the same distinction.

Could split off the definitions for at least lemma and morph, to: use 
prefix, reserve space as top-level delim, but allow hyphen etc.

Problem: annotateRef is union of osisRef, osisID, and osisGen (lemma/morph).

[[sjd: was that really a good idea??

Prob: the last case means you're annotating metadata....

Todd: no, want to refer to a word in the abstract

Steve: that should be pointing to a lexicon entry via its osisID

Todd: That's even narrower, if lemmas have to be osisIDs

[[sjd:

how about dividing the ambiguity of what the value of annotateRef is 
-- split into 3 attributes and simplify the regexes. much easier to 
explain, and to maintain in schema, and can avoid interference 
between osisRefs and lemmas and morphTags....

What additional things may we want to annotate?

    -- the content, start/end location, gi, and attrs of any elements 
at all (or content portions)

    --

(very lengthy discussion of syntax of annotateRef)

In the end: lemmas and morphs etc. are semantically osisIDs into 
other documents. It seems it would disturb an awful lot of 
already-complex syntax, or require some ugly mode switch/double 
prefix/something. In the end we ended up keeping the same reserved 
characters, and deciding to add to the manual very clear statements 
of the list, and recommendations of what to do with legacy 
identifiers that include those characters. The reserved characters 
are:

     (space) - : [ ] @ !

Period can be used but not doubled.






Meeting in august?

At Calvin?

Prepare, submit, and review PSI sets

sjd and kel to prepare initial fsds, etc. for hebrew morphs




Priorities:

Finish manual

LAWG

Test sermons/commentaries/devotionals/etc

Versification system declarations and mapping

Liaison w/ Bible Forum audio/video stuff

Grant development for text prep

Items passed to list

extract list of schema-affecting issues and resolutions, release 
*only* for testing of conversion software, etc.

2.1 schema:

-- 

Steve DeRose -- http://www.derose.net
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@acm.org  or  steve@derose.net