[sword-devel] indexed search discrepancy

Sun Aug 30 13:15:08 MST 2009

>> Just out of curiosity, what are the silly transformations?
>
> See: http://www.gossamer-threads.com/lists/lucene/java-user/80838
>
> Basically, the StandardAnalyzer has a tokenizer that recognizes complex
> patterns to determine word boundaries. By and large, these transformations
> (e-mail addresses, host names, ...) won't be found in the Bible. Maybe in
> commentaries and gen books. But there is a cost of running an expensive
> analyzer that generally does nothing and occasionally does something
> unexpected.
>
> The SimpleAnalyzer merely looks for word boundaries that are appropriate for
> English. It is not appropriate for languages that have different punctuation
> or word boundaries. There are a bunch of contributed analyzers for different
> languages (e.g. Thai, Chinese) that are more appropriate for them. In the
> upcoming Lucene 3.0 release there will be analyzers for more languages,
> including Farsi. These could be ported from Java to C++ if they are valuable
> to SWORD.

But the StandardAnalyzer is no more appropriate for non-English,
correct? So unless we have the non-English analyzers, then there is no
value in using the StandardAnalyzer over the simple? clucene is still
trying to become compatible with Lucene 2 (I think it's largely done,
but not released yet). If these analyzers are for Lucene 3.0, is it
possible that it would take substantial work to port them to clucene
which is still stuck in Lucene 1 compatibility?

Matthew