[sword-devel] Search in any language (was: Search in Chinese modules)
eekaikko at mail.student.oulu.fi
Thu Feb 11 05:46:32 MST 2010
On Wed, 10 Feb 2010, DM Smith wrote:
> Chinese needs a special analyzer. In java Lucene there are 3 choices.
> Two of them do some kind of bigram search. Basically it takes every
> two chars and indexes them. So ABCD is indexed as AB BC CD. The same
> analyzer would be used to prepare the search request.
> From what I gather spaces are not the appropriate "word" boundary.
> In JSword we use the module's lang to pick an appropriate analyzer.
> When we added it we didn't worry about backward compatibility. We
> considered it as a bug fix. No one complained about having to rebuild
> indexes. We did get thanks, though.
JSword has the luxury of better lucene implentation. But I think it's
still inadequate. I suppose there's no analyzer for e.g. Finnish, which
is extremely difficult language to parse (quite much like koine but
maybe even more difficult - two forms of one word may even have no
common letters at all and there are no strict rules). I once or actually
twice asked what would be a proper way to tag the OSIS document with
native language lemmas (not Greek or Hebrew lemmas) for each word but
never got an answer. That would solve the problem or at least would
make it possible to index the lemma fields for the indexed search.
Eeli Kaikkonen (Mr.), Oulu, Finland
e-mail: eekaikko at mailx.studentx.oulux.fix (with no x)
More information about the sword-devel