[sword-devel] Chinese PinYin, OSIS, SWORD and front-ends

DM Smith dmsmith at crosswire.org
Tue Oct 19 15:15:12 MST 2010

On Oct 19, 2010, at 5:20 PM, Chris Little wrote:

> On 10/19/2010 1:54 PM, Matthew Talbert wrote:
>> On Tue, Oct 19, 2010 at 4:19 AM, David Haslam<d.haslam at ukonline.co.uk>  wrote:
>>> Something to ponder for the future then, maybe?
>>> See �http://crosswire.org/wiki/Talk:Transliteration
>>> http://crosswire.org/wiki/Talk:Transliteration
>>> Thanks, Chris, for useful comments there.
>> As Chris says there, it would require indexing both versions of the
>> module, something I don't believe is currently possible. What would be
>> cool (imo) is to have the transliterated text available in a different
>> field, much as lemma is done now. Then a search for trans:something
>> would access the transliterated data. Of course, it would be nice to
>> provide this transparently to the end user.
> I'm really about as ignorant of (C)Lucene as a person can be, so someone please correct me if I'm wrong. I believe our indexing just indexes at the record level (verses or dictionary entries). So, upon creation of the index, you could just concatenate the text and the transliterated text and do indexing for that. Unless you need to support exact string matches across record boundaries, the concatenation shouldn't affect results.

This is how it currently works. The better way to do it is to put it into its own field so that it couldn't affect across record searches (which is not really a possibility now). This is what we do for Strong's numbers.

There are problems with this approach. They fundamentally boil down to the index and the search request have to be normalized in the same way and a user's expectation when searching.

Take Greek, for example, οἶκος. One might want to search without the accents, οικος; perhaps without the final sigma form, οικοσ; possibly with some mix of upper characters ΟΙΚΟΣ; or maybe transliterated, oikos. In Lucene contrib, there is a Greek analyzer that handles all but transliteration. It might also be in clucene.

My guess is that we can add a transliterator that would go from oikos to οικοσ. That way, the transliteration would not need to be stored.

I don't know enough about Chinese and PinYin to know if this would work.

In Him,

> Something I mention on the wiki, that I think you're also advocating, is doing transliteration of the text on a word-by-word basis and placing the result in the <w xlit="..."> attribute (all via a filter). That partly depends on the sourcetype being OSIS (though we could do it to plaintext too, and change its sourcetype at runtime). We could certainly run such a filter process prior to indexing, which would mean that the transliterated text could be searched, even if transliteration is turned off in the current view.
> --Chris
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

More information about the sword-devel mailing list