[sword-devel] indexed search discrepancy
dmsmith at crosswire.org
Sun Aug 30 14:14:20 MST 2009
On Aug 30, 2009, at 4:15 PM, Matthew Talbert <ransom1982 at gmail.com>
>>> Just out of curiosity, what are the silly transformations?
>> See: http://www.gossamer-threads.com/lists/lucene/java-user/80838
>> Basically, the StandardAnalyzer has a tokenizer that recognizes
>> patterns to determine word boundaries. By and large, these
>> (e-mail addresses, host names, ...) won't be found in the Bible.
>> Maybe in
>> commentaries and gen books. But there is a cost of running an
>> analyzer that generally does nothing and occasionally does something
>> The SimpleAnalyzer merely looks for word boundaries that are
>> appropriate for
>> English. It is not appropriate for languages that have different
>> or word boundaries. There are a bunch of contributed analyzers for
>> languages (e.g. Thai, Chinese) that are more appropriate for them.
>> In the
>> upcoming Lucene 3.0 release there will be analyzers for more
>> including Farsi. These could be ported from Java to C++ if they are
>> to SWORD.
> But the StandardAnalyzer is no more appropriate for non-English,
It is no more appropriate. But it may be less.
> So unless we have the non-English analyzers, then there is no
> value in using the StandardAnalyzer over the simple?
Even with the non-English analyzers there is no value in the
StandardAnalyzer over the Simple.
> clucene is still
> trying to become compatible with Lucene 2 (I think it's largely done,
> but not released yet). If these analyzers are for Lucene 3.0
Most are part of 2.x.
> is it
> possible that it would take substantial work to port them to clucene
> which is still stuck in Lucene 1 compatibility?
I don't think the effort is much harder than doing an initial port to
the same level. A tokenizer merely takes an input stream and breaks it
up into tokens and returns a token each time next(...) is called. What
differs between the releases is how next is implemented. The algorithm
is the same. (BTW, I am a Lucene contributor wrt tokenizers so my
point is not merely academic;)
In His Service,
More information about the sword-devel