[sword-devel] Musings about the Cherokee NT module
chrislit at crosswire.org
Mon Jul 2 18:19:54 MST 2012
On 7/2/2012 5:47 AM, Greg Hellings wrote:
> Is there an available (and proper-name-tagged!) version of the Bible
> in a sister tongue to Cherokee that we could use as the basis for
> comparisons? "David" -> "dewi" seems a pretty distant comparison that
> is far more likely to yield issues than if we have a sister tongue
> where "dawi" or what have you is already marked as a proper name.
> Having such a related language would greatly enhance the accuracy of
> this portion of the work.
The naive, orthographic edit distance between 'david' and 'dewi' is 3-5.
(5 if substituions cost 2, 3 if they cost 1.)
With metaphone (a modern soundex-type algorithm) that just assumes the
Cherokee is English, 'david' and 'dewi' become 'tft' and 'tw'
respectively, with an edit distance of 2-3.
Knowing some things about Cherokee helps us tune the algorithm for
Cherokee. For example, it has no final consonants, so maybe we shouldn't
penalize extra final consonants in English as much.
And we could also go straight from Hebrew/Greek instead of English,
since it appears the Cherokee is transliterating names from
Hebrew/Greek, not English. David from Hebrew would be transliterated
'dawid' and its metaphone-equivalent would be 'twt'. That's got an edit
distance of 1 from the Cherokee 'tw'. If we discount the cost of a
difference in final consonants, the edit distance would be even less
(0.5, for example).
I imagine we could also examine the English and Hebraicize/Hellenize it,
as appropriate, to reconstruct a passable metaphone-equivalent of the
Hebrew/Greek from English. In the above, the 'v' in 'david' obviously
came from waw, so in Hebrew names, 'v' should probably become 'w' in in
our modified metaphone algorithm.
So, all told, we could probably tag names with very high accuracy using
names from a text in an unrelated language.
More information about the sword-devel