[sword-devel] Some other fast search index options
Sun, 17 Sep 2000 15:05:59 +0200
Good day Jerry Hastings and others
On 3rd September, Jerry Hastings wrote:
> Here are three more methods to try. Consider the bit map
> divided into records, of say 32, 16, or 8 bits each.
I did the tests as you recommended. The size of the index file
comes down from 894 Kb to about 814 Kb.
I also tried using Huffman compression, LZW, LZSS, plain RLE,
or a combination of all the above. I got it to about 600K.
The problem is that we are still compressing the BITMAP files.
The uncompressed bitmap files are 48.8 Mbyte. Compressing this
down to 600K is impressive, but we can do better by rather
trying to compress the verse-lists (list of verses in which the
word occurs). The uncompressed size of these lists is 1.23 Mb.
Reading up on the subject in the latest research papers,
they all seem to indicate that the best compression of the
index file is to use these verse-lists (option 1, also
called 'inverted lists'), and compress these lists.
The steps are:
1. Get the list of verses where the word occurs.
This can be an array of 16-bit words (for the Bible)
2. Get the differences between the verses (deltas), e.g.
if the verses are 3, 10, 15, 21, ...
the delta list would be 3, 7, 5, 6, ...
These deltas can be fairly much the same throughout the list.
3. Compress this delta-list with something like
arithmetic coding (seems to work best).
To use this index, the following will have to be done:
1. Evaluate the search query
2. Get the words in the word-list.
3. Get their corresponding compressed delta-lists.
5. Convert delta-list to verse-list
6. Convert to bitmap if you need to do NOT, OR, AND, or
Question: Is it really worth all that effort to save 400K
on the size of the index? A few years ago I would have done
it, but with the smallest hard disks for PC's being 8 Gb
nowadays, is it really a big deal to have an index of 900K?
If you really want to compress, then why keep the text of the
Bible uncompressed? You can save much more compressing that! :)