[sword-devel] Musings about the Cherokee NT module
dmsmith at crosswire.org
Mon Jul 2 19:38:45 MST 2012
On Jul 2, 2012, at 9:19 PM, Chris Little wrote:
> On 7/2/2012 5:47 AM, Greg Hellings wrote:
>> Is there an available (and proper-name-tagged!) version of the Bible
>> in a sister tongue to Cherokee that we could use as the basis for
>> comparisons? "David" -> "dewi" seems a pretty distant comparison that
>> is far more likely to yield issues than if we have a sister tongue
>> where "dawi" or what have you is already marked as a proper name.
>> Having such a related language would greatly enhance the accuracy of
>> this portion of the work.
> The naive, orthographic edit distance between 'david' and 'dewi' is 3-5. (5 if substituions cost 2, 3 if they cost 1.)
> With metaphone (a modern soundex-type algorithm) that just assumes the Cherokee is English, 'david' and 'dewi' become 'tft' and 'tw' respectively, with an edit distance of 2-3.
> Knowing some things about Cherokee helps us tune the algorithm for Cherokee. For example, it has no final consonants, so maybe we shouldn't penalize extra final consonants in English as much.
> And we could also go straight from Hebrew/Greek instead of English, since it appears the Cherokee is transliterating names from Hebrew/Greek, not English. David from Hebrew would be transliterated 'dawid' and its metaphone-equivalent would be 'twt'. That's got an edit distance of 1 from the Cherokee 'tw'. If we discount the cost of a difference in final consonants, the edit distance would be even less (0.5, for example).
> I imagine we could also examine the English and Hebraicize/Hellenize it, as appropriate, to reconstruct a passable metaphone-equivalent of the Hebrew/Greek from English. In the above, the 'v' in 'david' obviously came from waw, so in Hebrew names, 'v' should probably become 'w' in in our modified metaphone algorithm.
> So, all told, we could probably tag names with very high accuracy using names from a text in an unrelated language.
If the names vary very little from one verse to another. That is "dewi" is a singular spelling of David, then one can probably take the set of verses that have David in them and look for the common words in Cherokee in those same verses. That would also narrow the set of words that need to be considered. It might narrow to one word.
More information about the sword-devel