[sword-devel] indexed search discrepancy

Matthew Talbert ransom1982 at gmail.com
Sat Aug 29 19:42:17 MST 2009


On Sat, Aug 29, 2009 at 10:26 PM, DM Smith<dmsmith at crosswire.org> wrote:
> FYI: Issue 2, removing stopwords, will break backward compatibility with
> existing indexes. The existing indexes will not contain the stopwords. New
> indexes will. This can be very confusing to users.

Two things: if we don't include them in the index, but only prevent
searching for them, then it wouldn't break compatibility. Secondly, no
one was able to search for them before without a segfault, so no one
will know anything except it doesn't crash anymore. (The issue of stop
words will be extremely less noticeable to users than the proposed
size changes. In some cases, 30% or less of certain text segments were
getting indexed, so this will make a huge difference in the number of
hits).

> If backward compatibility is ok to be broken, I suggest changing from
> StandardAnalyzer to SimpleAnalyzer. It does not have stopwords to begin with
> and will index the text without the silly transformations that the
> StandardAnalyzer does.

Just out of curiosity, what are the silly transformations?

> The segfault is surprising to me. I suggest checking with the clucene folks
> to see why it is happening. I really doubt it is a bug in clucene but
> SWORD's use of it.

I think perhaps we're supposed to strip out the stop words before
querying clucene. It's easier just to set the stop words to NULL in
the first place. It should be noted, that (afaik), the stop words are
only English for clucene (lucene has analyzers for other languages
that have different stop words). Notice that this issue affects
crosswire.org/study as well.

> Adding additional fields probably should be accompanied by adding versioning
> the index. What the Java Lucene folks are doing for version 3.0 is to store
> with the index a manifest of sorts that describes what was used to build the
> index.

I agree with versioning the index. I would increment it every time
something changed that would affect the indexing, like the change for
Hebrew, or the proposed field size change. Both of these changes break
backwards-compatibility.

Matthew



More information about the sword-devel mailing list