[sword-devel] Some other fast search index options

Drew Haninger sword-devel@crosswire.org
Mon, 18 Sep 2000 07:17:57 -0700


With Palm Pilot's with more memory available, it would seem not worth the
effort, but with cell phones on the horizon as a new platform, the effort
may be worth it.  We never finished doing a better compression for the
smaller Palm Pilot's.

Drew Haninger

----- Original Message -----
From: "Nathan" <mail@nathan.co.za>
To: <sword-devel@crosswire.org>
Sent: Sunday, September 17, 2000 6:05 AM
Subject: RE: [sword-devel] Some other fast search index options


> Good day Jerry Hastings and others
>
> On 3rd September, Jerry Hastings wrote:
> > Here are three more methods to try.  Consider the bit map
> > divided into records, of say 32, 16, or 8 bits each.
>
> I did the tests as you recommended. The size of the index file
> comes down from 894 Kb to about 814 Kb.
> I also tried using Huffman compression, LZW, LZSS, plain RLE,
> or a combination of all the above. I got it to about 600K.
>
> The problem is that we are still compressing the BITMAP files.
> The uncompressed bitmap files are 48.8 Mbyte. Compressing this
> down to 600K is impressive, but we can do better by rather
> trying to compress the verse-lists (list of verses in which the
> word occurs). The uncompressed size of these lists is 1.23 Mb.
> (Option 1)
>
> Reading up on the subject in the latest research papers,
> they all seem to indicate that the best compression  of the
> index file is to use these verse-lists (option 1, also
> called 'inverted lists'), and compress these lists.
>
> The steps are:
> 1. Get the list of verses where the word occurs.
>    This can be an array of 16-bit words (for the Bible)
> 2. Get the differences between the verses (deltas), e.g.
>    if the verses are 3, 10, 15, 21, ...
>    the delta list would be 3, 7, 5, 6, ...
>    These deltas can be fairly much the same throughout the list.
> 3. Compress this delta-list with something like
>    arithmetic coding (seems to work best).
>
> To use this index, the following will have to be done:
> 1. Evaluate the search query
> 2. Get the words in the word-list.
> 3. Get their corresponding compressed delta-lists.
> 4. Uncompress.
> 5. Convert delta-list to verse-list
> 6. Convert to bitmap if you need to do NOT, OR, AND, or
>    wildcards
>
> Question: Is it really worth all that effort to save 400K
> on the size of the index? A few years ago I would have done
> it, but with the smallest hard disks for PC's being 8 Gb
> nowadays, is it really a big deal to have an index of 900K?
>
> If you really want to compress, then why keep the text of the
> Bible uncompressed? You can save much more compressing that! :)
>
> God bless,
> nathan
> http://www.nathan.co.za
>
>
>