[sword-devel] Fast search index options

Joe Walker sword-devel@crosswire.org
Sun, 3 Sep 2000 11:52:57 +0100


> On 29th August 2000, Joe wrote:
> >DistinctPassage - a simple list of verses.
> >RangedPassage - stores a list of verse pairs (start and end)
> >BitwisePassage - stores a 31k long bitmap. One bit per verse
...
> Option 2: The only case where this helped was with the 
> word 'gibeonites' which occurs 5 times in the Bible,
> and has 2 ranges. Option 1 is 10 bytes and option 2
> is 8 bytes. We thus save 2 bytes. Not really worth it
> for all the extra coding effort that needs to be put in.
> I would thus discard this option.

Probably sensible.

Note there are 2 different things - on disk storage
and in memory storage which have different requs.

The reason it is there is probably
mostly historical. The code you have enables lots of maths
on Passages. (Add this set to that set, etc) and
ranges seemed like a good idea early on. My code uses
bitmaps internally almost everywhere. In my tests for
in-memory speed they are usually quickest. There are
some specific cases when RangedPassage is very quick
but in general it is very slow.

> Using just option 1 and 3 (and choosing the best option for
> each word), I get the following:
> 52 words * 3888 bytes = 202176 bytes
> 12508 words: Total length of option 1 = 679446 bytes
> Need an indicator for each word to tell us which option was 
> used = 12560 bytes
> TOTAL: 894182 bytes (about 0.9 Mb)
> 
> I think this size is quite acceptable for an index (personal opinion :)

Agreed.
I'll perhaps have a look at why mine is so big. I expect the
reason is that I have used Java serialized objects (for
speed reasons)

Joe.