[sword-devel] Searching and Lucene thoughts
will at thimbleby.net
Tue Mar 1 15:42:04 MST 2005
I apologies for my ramblings, but here are some searching thoughts that
I've collected as I implemented lucene searching in MacSword:
It is more complicated than I thought, and lucene doesn't quite do
everything. Certainly to do a document range is something that needs to
be bolted onto lucene. In pro bible software it gets very complicated,
in accordance you can do some insane searches. Martin might be onto
something trying to write his own, it would be fine to take 10x as long
as lucene, and support everything we want. But on the other hand this
is the guy writing lucene
(http://lucene.sourceforge.net/publications.html) it might make sense
to alter lucene to our requirements.
GCJ Lucene vs CLucene vs Lucene:
I tried to compile lucene with gcj (the svn distribution of lucene
comes with a make file that worked straight off) it weighs in at 1.3mb,
but you will probably still need some part of the 5mb libgcj library. I
didn't get any further with this solution. Might be a possibility, but
I haven't yet built anything with it.
Troy: you asked for my code to access index order, I can give you java
code, but clucene doesn't support it yet. There seem to be many areas
where clucene is lagging far behind lucene. For example, sorting which
to do in lucene is essential for fast searching.
A file storing the module version and the index method version is
essential. I have changed my index structure several times, and
probably will do in the future (eg. for morphology searching). I don't
store the indexes with the modules in case the modules are loaded from
a CD or locked.
Top twenty words in KJV:
unto, shall, lord, he, his, all, thou, them, which, i, him, said, thy,
from, god, thee, ye, shalt, children, israel
Lucene index types and indexing speed:
KJV index with java version of lucene = 8'38 (3MB)
using the simple analyser = 8'02 (3.1MB)
using setMergeFactor(1000); setMaxBufferedDocs(1000) (previously called
minmergedocs) = 5'47 -- uses about 90mb of memory change these two
parameters to control excessive file handles.
Size of index 2.6Mb or 6.8MB storing the verses.
Note that the KJV = 5.4MB. Thus the KJV and an index is larger than the
index with stored verses. It is also faster to access, but probably
takes up a load of memory. ;P
The standard analyser looks for things like emails and other stuff --
and last I checked Jesus didn't have an email address. The stop
analyser might be better if we want to cull words like "and" and "the",
but why stop the user. There are 23867 verses containing "and" in the
KJV. :) The standard analyser also culls apostrophes, (I don't think we
Look up is fast, but I render all the verses which takes far longer.
Note that this isn't so important now, because I only load the verses
when they are displayed and then I cache them, which reduces the
display time to nothing.
Search for "jesus" 943 results
Search: 67ms (negligible)
Display (stored in lucene): 3s
Search for "god" 3892 results
Display (stored): 11s
Search for "god*" 4094 results
Display (stored): 11s
Ordering of searches:
The results really need to be ordered by bible verse, lucene's ranking
means that the shortest verses always come first, eg. "Jesus wept." is
always the top verse for "jesus". IMO this doesn't make much sense to
My current solution is to sort by index order. Another solution is to
store keys as indexes: You can store these as a string, lucene can then
do the sorting for you. (NB you seem to need store them as fixed width
Restricting of searches:
Again another area that is essential for speed to do in lucene. I
haven't figured this one out yet, but I'm thinking I will write a
custom lucene filter. Which would be much faster if I stored the verse
as an index, and then produced a set of numerical ranges. For searching
in the previous results, you should (I've been told) simply AND the
searches together. I don't support these yet, and it is probably quite
some work, -- it would probably only take 10s of searching time to
retrofit it ontop of lucene, but that is 10s ontop of nothing.
Fuzzy searches are neat "abraham~" finds abram and abraham; "hezikia~"
finds hezekiah. Really useful for bad spellers and all those
ridiculously impossible to spell bible names.
To highlight searches, you can get lucene to give you a list of words
for a search. You can then highlight all of these words in the verse.
IMO rarely do people want to do OR searches, so I changed the default
to AND in the lucene version used by MacSword. This means >>jesus
wept<< is ANDed >>jesus OR wept<< is ORed, and >>"jesus wept"<< is the
phrase. Other than that the lucene syntax makes sense.
More information about the sword-devel