[osis-core] Linguistic Annotation Module Design Document -- linguistic issues

Chris Little osis-core@bibletechnologieswg.org
Fri, 07 Nov 2003 15:50:44 -0600


Kirk,

To some degree, some of these issues will be answered along with the 
question from my previous reply regarding whether the LA module should 
be able to handle all languages (needing just a language declaration 
document).

The elements listed include just <w> and <morpheme>, with the only 
change to <w> being the inclusion of <morpheme>.  I would further 
suggest adding most of the attributes assigned to <morpheme> to <w>.  
Many existing texts only describe features down to the word level.  
Nonetheless, if we encountered a word like "walked" we might wish to add 
parsing info, such as <w tense="past">walked</w>, if we lacked the data 
to identify features of specific morphemes.  All of CrossWire's Greek 
texts marked with morphological data are in a situation that would 
prevent them from being able to use the LA module if data like this were 
not allowed at the word level.  The situation we would be forced into 
would be to put a <morpheme> element inside of every <w> element for the 
purpose of hanging attributes even though the contents of the <morpheme> 
element would not actually be morphemes.

If a <morpheme> is marked with a number attribute, there are two 
different ways I can think of that it could be interpreted.  It might 
seem that they should be obvious from context, but I still think a 
method of disambiguation would be valuable.  The number attribute could 
indicate either feature embodiment/assignment or agreement.  E.g. the 
sentence "He walks." would probably be marked as <p><w><morpheme 
number="singular">He</morpheme></w> 
<w><morpheme>walk</morpheme><morpheme 
number="singular">s</morpheme></w></p>.

This seems to raise a number of issues.  Since this verb happens to be 
intransitive and English only has subject person-number agreement, it's 
obvious what 's' agrees with in number.  Plenty of languages would need 
a facility for distinguishing between subject and object agreement.  Is 
"subject agreement" a possible value of the "pos" attribute?  (I'm 
generally unclear of the function of "pos" on a morpheme, since this is 
a feature of words in every grammatical framework with which I am 
familiar, and most deny you the right to look back into a word, separate 
the affixes, and identify them with parts of speech.)

Regarding non-linear affixation, I would suggest providing a facility 
like we have for quotation in the core schema:  allow a splitID on 
<morpheme> and allow recursive embedding of <morpheme>.  For example, in 
German, you've got singular Apfel 'apple', plural Äpfel.  Pluralization 
occurs by non-linear affixation, namely  umlaut, identified graphically 
by the diaeresis.  I would encode this as (roughly) 
<w><morpheme>A<morpheme 
number="plural">¨</morpheme>pfel</morpheme></w>.  I don't have any idea 
how you would mark morphemes that are not graphically represented, such 
as the intonational difference that derives the noun -'produce- from the 
verb -pro'duce-; those might just have to be assumed to be suppletive.

There are a number of Hebrew-specific attributes, which seem to be all 
of those marked by a star.  I think (and I assume everyone would agree 
with me on this--and hope everyone can be convinced of this, if not) 
that a person doing linguistic annotation of a text should have the 
ability to use the terms that are standard to work in that language.  
E.g., if I'm working on German, I would want to be able to mark noun 
genders and in Hebrew I would want to be able to identify stem types.  
That said, I think the Hebrew linguistic vocabulary might be so 
distinctive as to deserve being removed to another module (one derived 
from the LA module).  Alternatively, maybe they could be indicated by a 
prefix like "heStem" instead of simply "stem", if Hebrew is deemed to be 
too central to OSIS LA to be removed like that.  (Side note: isn't there 
a histpael stem?  I seem to remember loosing some points on a quiz for 
marking a verb as hitpael--stupid me.)

I would recomment that the "kqtype" be removed from the LA module 
entirely, since it's not linguistic in nature.  We should probably add 
<seg type="ketiv"> and <seg type="qere"> to the next release of OSIS 
Core--or else a more permanent solution.

More generally....  Verbs typically can have (in addition to tense, 
which is already accounted for): aspect, voice, mood, & modality.  These 
all probably deserve attributes on the morpheme.  Case is also a notable 
omission from the attributes that would apply to nouns, and I would 
further suggest adding semanticRole or something equivalent.  I strongly 
recommend a gloss attribute as well on the morpheme.  If anyone uses the 
LA module to generate an interlinear, it will be necessary.  Gender, 
cross-linguistically does not have a range of values that can be 
enumerated.  Masculine/feminine/neuter are good standard values for many 
languages, but, e.g. Dyirbal would use numerals 1-4 (men, women, edible 
plants, & other), Korean might use myriad values like "paper", "stick", 
"color", etc.

Inflectional morphology is pretty well handled, but most derivational 
morphology isn't, in the proposed system.  There's no means for 
signaling that a morpheme derives a causative, applicative, passive, 
antipassive, reflexive, reciprocal, etc. form of a verb.  Nor is there a 
means for indicating nominalizers, adjectivizers, etc.  (That is, unless 
this is the function of the pos attribute, or something else that I'm 
not noticing.)

I think we should also adopt a set of values (like 'past', 'preterite', 
'noun', 'subjunctive', 'passive', etc.) defined by some outside 
linguistic authority.  I checked EAGLES for a list, but theirs is both 
very incomplete and frequently mis-classifies attributes in a way that 
suggests to me that they shouldn't be trusted as an authority.  Perhaps 
the LSA or some international body has compiled a usable authority list.

That's all for now, at least.

--Chris


Kirk Lowery wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Friends,
>
> For your amusement -- but more especially for your expert comment -- I
> attach a first draft of a schema design document for OSIS linguistic
> annotation; more precisely, for morphologic annotation. We'll get to
> syntactic annotation after this. This is the concrete outcome of the
> intensive three days of face to face work Steve and I did last week.
>
> - --
> Kirk E. Lowery, Ph.D.
> Director, Westminster Hebrew Institute
> Adjunct Professor of Old Testament
> Westminster Theological Seminary, Philadelphia
>
> Theorie ist, wenn man alles weiss und nichts klappt.
> Praxis ist, wenn alles klappt und keiner weiss warum.
> Bei uns sind Theorie und Praxis vereint:
> nichts klappt und keiner weiss warum!
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.1 (MingW32)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
>
> iD8DBQE/pmOSfUA6+Yl7duERArbmAKCPWUAGbMLRI8+PmycwjUTwGZHoYwCg0jkc
> O8WsRiTQ2MVUbRtuSOeNbkE=
> =jKEb
> -----END PGP SIGNATURE-----