[sword-devel] Fast search index options

Jerry Hastings sword-devel@crosswire.org
Sat, 02 Sep 2000 22:55:41 -0700

At 10:51 AM 9/2/2000 +0200, Nathan wrote:
>On 29th August 2000, Joe wrote:
> >DistinctPassage - a simple list of verses.
> >RangedPassage - stores a list of verse pairs (start and end)
> >BitwisePassage - stores a 31k long bitmap. One bit per verse
>I did some quick tests this morning on the 3 options.
>The bitmaps have fixed size (3888 bytes)
>The first 2 options depend on how many times the word occurs
>in the Bible.

Here are three more methods to try.  Consider the bit map divided into 
records, of say 32, 16, or 8 bits each.

1) Record a list of non-zero bit map records for a word. The list would 
have 6 byte records of two fields. One to store a 32 bit record from the 
bit map and  the other, an integer to store the bit map record 
number.  This may be good for words like "Micaiah" which is in 18 verses, 
but only two chapters. Depending on where the record boundaries fall this 
may reduce to a two record list.

2)  Similar to 1, but instead of the second field being a record number, 
the second record indicates a number of blank  records following the 
non-zero record. Sort of a RLE compression. This second field could be 8 
or16 bits for all records in the list depending on the largest break 
between verses with the word.

3) Split the bit map into bytes. If the first byte is non-zero, count the 
number of bytes to the first zero byte, but not more than 128 bytes. Add 
128 to the count and save as the first byte of a file, followed by the 
counted non-zero bytes. If the first byte was zero, count the number of 
bytes until a non-zero byte, but not more than 128 bytes. Save the count as 
the first byte of a file, but do not save the zero bytes. Start again at 
the first byte after the bytes counted above and keep repeating. For some 
words, like "the", NOT the bit map first--flip 1s and 0s.

These methods should allow for quick rebuilding of the bit maps and for 
reproducing just a section of a bit map for ORing with another bit map.