[sword-devel] Fast search index options

Nathan sword-devel@crosswire.org
Sat, 2 Sep 2000 10:51:48 +0200


Hi,

On 29th August 2000, Joe wrote:
>DistinctPassage - a simple list of verses.
>RangedPassage - stores a list of verse pairs (start and end)
>BitwisePassage - stores a 31k long bitmap. One bit per verse

I did some quick tests this morning on the 3 options.
The bitmaps have fixed size (3888 bytes)
The first 2 options depend on how many times the word occurs 
in the Bible.

*Note*
The test was done on the list of words as I extracted them.
My list has 12560 unique words for the KJV.

Results:
In 52 cases the bitmap (option 3) was the smallest
In 1 case the RangeList (option 2) was the smallest 
The rest of the time the list of pointers to verses 
   (option 1) are the smallest.

Option 2: The only case where this helped was with the 
word 'gibeonites' which occurs 5 times in the Bible,
and has 2 ranges. Option 1 is 10 bytes and option 2
is 8 bytes. We thus save 2 bytes. Not really worth it
for all the extra coding effort that needs to be put in.
I would thus discard this option.

Using just option 1 and 3 (and choosing the best option for
each word), I get the following:
52 words * 3888 bytes = 202176 bytes
12508 words: Total length of option 1 = 679446 bytes
Need an indicator for each word to tell us which option was 
used = 12560 bytes
TOTAL: 894182 bytes (about 0.9 Mb)

I think this size is quite acceptable for an index (personal opinion :)

God bless,
nathan