[sword-devel] Searching and Lucene thoughts

Tue Mar 1 15:42:04 MST 2005

I apologies for my ramblings, but here are some searching thoughts that 
I've collected as I implemented lucene searching in MacSword:

Searching
It is more complicated than I thought, and lucene doesn't quite do 
everything. Certainly to do a document range is something that needs to 
be bolted onto lucene. In pro bible software it gets very complicated, 
in accordance you can do some insane searches. Martin might be onto 
something trying to write his own, it would be fine to take 10x as long 
as lucene, and support everything we want. But on the other hand this 
is the guy writing lucene 
(http://lucene.sourceforge.net/publications.html) it might make sense 
to alter lucene to our requirements.

GCJ Lucene vs CLucene vs Lucene:
I tried to compile lucene with gcj (the svn distribution of lucene 
comes with a make file that worked straight off) it weighs in at 1.3mb, 
but you will probably still need some part of the 5mb libgcj library. I 
didn't get any further with this solution. Might be a possibility, but 
I haven't yet built anything with it.

Troy: you asked for my code to access index order, I can give you java 
code, but clucene doesn't support it yet. There seem to be many areas 
where clucene is lagging far behind lucene. For example, sorting which 
to do in lucene is essential for fast searching.

Indexes:
A file storing the module version and the index method version is 
essential. I have changed my index structure several times, and 
probably will do in the future (eg. for morphology searching). I don't 
store the indexes with the modules in case the modules are loaded from 
a CD or locked.

Top twenty words in KJV:
unto, shall, lord, he, his, all, thou, them, which, i, him, said, thy, 
from, god, thee, ye, shalt, children, israel

Lucene index types and indexing speed:
KJV index with java version of lucene = 8'38 (3MB)
using the simple analyser = 8'02 (3.1MB)
using setMergeFactor(1000); setMaxBufferedDocs(1000) (previously called 
minmergedocs) = 5'47 -- uses about 90mb of memory change these two 
parameters to control excessive file handles.

Size of index 2.6Mb or 6.8MB storing the verses.
Note that the KJV = 5.4MB. Thus the KJV and an index is larger than the 
index with stored verses. It is also faster to access, but probably 
takes up a load of memory. ;P

Analysers:
The standard analyser looks for things like emails and other stuff -- 
and last I checked Jesus didn't have an email address. The stop 
analyser might be better if we want to cull words like "and" and "the", 
but why stop the user. There are 23867 verses containing "and" in the 
KJV. :) The standard analyser also culls apostrophes, (I don't think we 
want to)

Speed:
Look up is fast, but I render all the verses which takes far longer. 
Note that this isn't so important now, because I only load the verses 
when they are displayed and then I cache them, which reduces the 
display time to nothing.

Search for "jesus" 943 results
Search: 67ms (negligible)
Display: 21s
Display (stored in lucene): 3s

Search for "god" 3892 results
Search: 13ms
Display:  1'10s
Display (stored):  11s

Search for "god*" 4094 results
Search: 40ms
Display: 1'11s
Display (stored): 11s

Ordering of searches:
The results really need to be ordered by bible verse, lucene's ranking 
means that the shortest verses always come first, eg. "Jesus wept." is 
always the top verse for "jesus". IMO this doesn't make much sense to 
the user.
	My current solution is to sort by index order. Another solution is to 
store keys as indexes: You can store these as a string, lucene can then 
do the sorting for you. (NB you seem to need store them as fixed width 
strings).

Restricting of searches:
Again another area that is essential for speed to do in lucene. I 
haven't figured this one out yet, but I'm thinking I will write a 
custom lucene filter. Which would be much faster if I stored the verse 
as an index, and then produced a set of numerical ranges. For searching 
in the previous results, you should (I've been told) simply AND the 
searches together. I don't support these yet, and it is probably quite 
some work, -- it would probably only take 10s of searching time to 
retrofit it ontop of lucene, but that is 10s ontop of nothing.

Other stuff:
Fuzzy searches are neat "abraham~" finds abram and abraham; "hezikia~" 
finds hezekiah. Really useful for bad spellers and all those 
ridiculously impossible to spell bible names.
	To highlight searches, you can get lucene to give you a list of words 
for a search. You can then highlight all of these words in the verse.

IMO rarely do people want to do OR searches, so I changed the default 
to AND in the lucene version used by MacSword. This means >>jesus 
wept<< is ANDed >>jesus OR wept<< is ORed, and >>"jesus wept"<< is the 
phrase. Other than that the lucene syntax makes sense.

cheers --Will