[sword-devel] indexed search discrepancy
dmsmith at crosswire.org
Sat Aug 29 19:26:36 MST 2009
FYI: Issue 2, removing stopwords, will break backward compatibility
with existing indexes. The existing indexes will not contain the
stopwords. New indexes will. This can be very confusing to users.
If backward compatibility is ok to be broken, I suggest changing from
StandardAnalyzer to SimpleAnalyzer. It does not have stopwords to
begin with and will index the text without the silly transformations
that the StandardAnalyzer does.
The segfault is surprising to me. I suggest checking with the clucene
folks to see why it is happening. I really doubt it is a bug in
clucene but SWORD's use of it.
Adding additional fields probably should be accompanied by adding
versioning the index. What the Java Lucene folks are doing for version
3.0 is to store with the index a manifest of sorts that describes what
was used to build the index.
On Aug 29, 2009, at 4:25 PM, Matthew Talbert wrote:
> I'm attaching a patch to fix several issues with indexed search.
> Issue 1: large text fields weren't getting indexed due to a low
> Resolution: change MAX_CONV_SIZE to 1024 * 1024, and add call to
> writer to boost its maximum field size
> Issue 2: search causes segfault when searching for stop words
> Resolution: set analyzer stop words to NULL for both index
> creation and search. Possibly this would only have to be set for
> search, and left on to lower the index size.
> Issue 3: index causes segfault *after indexing* when module location
> isn't writable.
> Resolution: check the return value of
> FileMgr::createParent(target + "/dummy"); if return value is -1, abort
> In addition, this patch adds fields for footnotes, morphology, and
> headers. I *really* would like to see this added to the default
> indexing. The reason is that with indexed search it is possible to
> combine fields in one search, something that SWORD attribute search
> doesn't allow (AFAIK). And indexed search is much faster, of course.
> My patch only covers one of the three spots this would apparently need
> to be added. I didn't understand why there was so much duplicated
> code, nor was I entirely comfortable with the code I had written, so I
> didn't expand it to cover all cases. It appears that the code for
> adding fields like strongs is the same in 3 different spots. Surely
> this could be condensed somehow?
> I really would like to see the first 3 issues fixed immediately (ie,
> before next release). Issue 1 makes most genbook indexed search
> pointless, while Issues 2 and 3 have both been reported as issues
> against Xiphos. Of course, we can't control the segfault in either
> case. As far as the extra fields, that will need some extra work, but
> I feel it's really important as well. At some point, I am going to
> redo the search functionality in Xiphos, and my plan is to implement
> indexing myself if these fields aren't in SWORD by then.
> I have been meaning to address these issues for some time, but hadn't
> gotten around to it yet. The bug report we had forced the issue. While
> we're at it, I'd like to bring up two more issues.
> 1. If the module location isn't writable, there isn't a way for the
> user to create an index. I would like to see indexes created somewhere
> else in this case, eg ~/.sword/indexes. I believe BT does something
> like this already.
> 2. We currently have no way of notifying the user if the indexes are
> no longer valid, or if they should be updated. I would like to see a
> versioning scheme for indexes. For example, with the changes here, and
> the changes for Hebrew search, all Hebrew indexes previously created
> are now useless. How do we tell the user that he needs to re-create
> the index? Along the same lines, all genbook indexes, and many
> commentary indexes are incorrect. With the next release of SWORD,
> hopefully with this issue resolved, it would be nice to be able to
> notify the user that the indexes are now out-of-date or incorrect and
> need to be rebuilt.
> Finally, I would like to point out a great tool for examining
> lucene/clucene indexes. You can get it here:
More information about the sword-devel