[sword-devel] Chinese lucene problem
dmsmith at crosswire.org
Sun Oct 7 16:21:38 MST 2012
SWORD uses an English analyzer (StandardAnalyzer) that works well for Latin-1 languages and for languages that bear some passing similarity to English (e.g. spaces between words, phonetic spelling, ...), but it does not do well with others.
The Lucene project has a few Chinese analyzers. Basically they do bi-gram indexing, every pair of letters is indexed. So the string ABCD would create 3 bi-grams, AB, BC, and CD. One of these analyzers is quite big and it might not be prudent to deliver it as part of the non-Chinese front-end.
For JSword, we use the language code as supplied in the conf to vector into the selection of the best analyzer. There are specialized analyzers for a dozen lanugages. Each one of them as pecularities that the StandardAnalyzer does not address properly. E.g. Thai does not have spaces for word breaks.
On Oct 7, 2012, at 6:34 PM, Karl Kleinpaste <karl at kleinpaste.org> wrote:
> We've got a bug report in Xiphos saying that Chinese modules can't be
> searched well with CLucene indices.
> I know nothing at all about Chinese, and can't address this. Can anyone
> supply some info?
> sword-devel mailing list: sword-devel at crosswire.org
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel