[sword-devel] seeking consensus on OSIS lemma best practice

Sat Oct 13 14:23:10 MST 2012

On 10/13/2012 6:12 AM, Daniel Owens wrote:
> Thanks, Chris. I had not thought of the latter solution, but that is
> what we need. This raises a fundamental question: how will front-ends
> find the right lexical entry?
>
> Currently, according to my understanding, a conf file may include
> Feature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest the
> following value also be allowed: Feature=AramaicDef. Then front-ends
> will be able to find entries in the correct language.

HebrewDef indicates that a lexicon module is indexed by Strong's 
numbers. Everything you've said so far indicates to me that you aren't 
using Strong's numbers at all, so do not use Feature=HebrewDef. Also, 
there should not ever be a Feature=AramaicDef since Aramaic Strong's 
numbers are not distinguished from Hebrew.

I think it would probably be helpful if you could enumerate the set of 
modules you propose to create:

a Bible (just one? more than one?)
a lexicon? separate Hebrew & Aramaic lexica?
a morphology database? separate Hebrew & Aramaic databases?

My guess is that you are advocating a Feature value that indicates "this 
lexicon module contains words in language X, indexed by lemma/word". I 
would absolutely be supportive of adding this, but we currently have 
nothing comparable in use. I would advocate 
(Greek|Hebrew|Aramaic|...)WordDef for the value.

> But lemmatization can vary somewhat in the details within a language.
> How could we include mappings between lemmatization? That way we could
> map between lemmatizations so a text using Strong's numbers could look
> up words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa.
> Perhaps a simple mapping format could be the following:
>
> The file StrongsGreek2AbbottSmith.map could contain:
> G1=α
> G2=Ἀαρών
> G3=Ἀβαδδών
> etc.
>
> Frontends could use these mappings to find the correct lexical entry. So
> A lookup from KJV could then find the relevant entry in AbbottSmith. And
> with a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNT
> could find the correct entry in Strongs, if that is the default Greek
> Lexicon for the front-end.
>
> I use Greek because I have the data ready at hand, but this method would
> be even more important for Hebrew. I was testing with BibleTime and
> found that only some of the lemma in WHM would find their way to the
> correct BDB entry. This is because their lemmatizations are different.
> Providing for a mapping would allow us to resolve those conflicts for
> the user. Also, the OSMHB module could find entries in BDB keyed to
> Hebrew, and the WHM could find entries in BDB or Strongs. I expect this
> mapping would need to happen at the engine level.
>
> Is that a reasonable solution? Or does someone have a better idea?

I believe that mapping to/from Strong's numbers is not one-to-one, but 
many-to-many. We currently allow lookups based on lemmata by keying 
lexica to lemmata. A lexicon can have multiple keys point to a single entry.

Ultimately, it would be very nice to write a stemmer for each of the 
relevant languages, index lexica by stem (or facilitate searches by 
stem), and thus do away with some of the need to pre-lemmatize texts. I 
don't know whether stemming algorithms exist for Greek & Hebrew or 
necessarily how reliable they would be, but it's an area worth some 
research.

--Chris