[sword-devel] Searching and Lucene thoughts

Tue Mar 1 17:45:43 MST 2005

Will Thimbleby wrote:

> I apologies for my ramblings, but here are some searching thoughts 
> that I've collected as I implemented lucene searching in MacSword:
>
> Searching
> It is more complicated than I thought, and lucene doesn't quite do 
> everything. Certainly to do a document range is something that needs 
> to be bolted onto lucene. In pro bible software it gets very 
> complicated, in accordance you can do some insane searches. Martin 
> might be onto something trying to write his own, it would be fine to 
> take 10x as long as lucene, and support everything we want. But on the 
> other hand this is the guy writing lucene 
> (http://lucene.sourceforge.net/publications.html) it might make sense 
> to alter lucene to our requirements.
>
Can we enumerate what Lucene does not support that we want for Biblical 
searching?

The only thing I saw was that it did not find adjacent documents. For 
example, find all verses containing Moses within 5 verses of Aaron.

As long as we build the index from first verse to last verse, the index 
that lucene returns is the number lucene assigned to the verse when the 
verse was added. We cannot reliably use this to figure out what verse is 
returned (e.g. 3 may or may not mean Genesis 1:3. For example, in a NT 
only module it would mean Matthew 1:3), for this reason we have stored 
the OSIS reference in the index along with the verse. However, we can be 
certain (cause lucene guarantees it) that index 25 and index 26 are two 
verses that were added one after the other.

To do proximity searching, we probably have to parse the search request 
for a special w/in conjunction and take each part and do separate 
queries, an via post processing, put the result together.

Has anyone thought of another way?

> Troy: you asked for my code to access index order, I can give you java 
> code, but clucene doesn't support it yet. There seem to be many areas 
> where clucene is lagging far behind lucene. For example, sorting which 
> to do in lucene is essential for fast searching.
>
I would be interested in the Java code, if you don't mind.

>
> Indexes:
> A file storing the module version and the index method version is 
> essential. I have changed my index structure several times, and 
> probably will do in the future (eg. for morphology searching). I don't 
> store the indexes with the modules in case the modules are loaded from 
> a CD or locked.

Can you send me your code that builds the index as well?

I agree that it probably would be best to store the index separate from 
the module.

>
>
<snip/>

> Analysers:
> The standard analyser looks for things like emails and other stuff -- 
> and last I checked Jesus didn't have an email address. The stop 
> analyser might be better if we want to cull words like "and" and 
> "the", but why stop the user. There are 23867 verses containing "and" 
> in the KJV. :) The standard analyser also culls apostrophes, (I don't 
> think we want to)
>
For JSword, we have been using the Standard Analyzer, but after your 
comments I took a look at the lucene code and I think that you are 
right. It increases the size of the index by a meg, but that is not that 
big a deal. I think that it will reduce the CPU usage as well. Time to 
do some more experimenting....

>
>
<snip/>

>
>
> Ordering of searches:
> The results really need to be ordered by bible verse, lucene's ranking 
> means that the shortest verses always come first, eg. "Jesus wept." is 
> always the top verse for "jesus". IMO this doesn't make much sense to 
> the user.
>     My current solution is to sort by index order. Another solution is 
> to store keys as indexes: You can store these as a string, lucene can 
> then do the sorting for you. (NB you seem to need store them as fixed 
> width strings).

Just a suggestion (which we use in JSword), use a BitSet to store the 
hits. It takes 31102 bits to represent the entire bible. This comes to 
3.8K. The bitset is implicitly ordered. Java allows pretty efficient 
iterating over the set.

I think there is room for two different kinds of searches:
1) Find verses that match the criteria that I provide. (Standard boolean 
searches)
2) Fuzzy searches, natural language searches, more like this searches, 
help me find a verse which is something like this search.

In the first case the answer set probably is best ordered by bible verse.
The second is probably better ordered by ranking.

>
>
> Restricting of searches:
> Again another area that is essential for speed to do in lucene. I 
> haven't figured this one out yet, but I'm thinking I will write a 
> custom lucene filter. Which would be much faster if I stored the verse 
> as an index, and then produced a set of numerical ranges. For 
> searching in the previous results, you should (I've been told) simply 
> AND the searches together. I don't support these yet, and it is 
> probably quite some work, -- it would probably only take 10s of 
> searching time to retrofit it ontop of lucene, but that is 10s ontop 
> of nothing.

The search speed of lucene is fast enough that restricting the search is 
not necessary. Using the BitSet does not add appreciable time. It is 
easy enough to create a mask and AND that with the search results to get 
the restricted answer set.

>
>
> Other stuff:
> Fuzzy searches are neat "abraham~" finds abram and abraham; "hezikia~" 
> finds hezekiah. Really useful for bad spellers and all those 
> ridiculously impossible to spell bible names.
>     To highlight searches, you can get lucene to give you a list of 
> words for a search. You can then highlight all of these words in the 
> verse.

I saw your other post on fuzzy match and would like to know how you got 
the words that were hit out of lucene.

>
> IMO rarely do people want to do OR searches, so I changed the default 
> to AND in the lucene version used by MacSword. This means >>jesus 
> wept<< is ANDed >>jesus OR wept<< is ORed, and >>"jesus wept"<< is the 
> phrase. Other than that the lucene syntax makes sense.

On a project that I did we found that people wanted to do phrase 
searching even more than AND and AND more than OR, unless they were 
doing "natural language" quering.
It might be nice to set it as a preference.