[sword-devel] Some other fast search index options

Trandahl, Steve sword-devel@crosswire.org
Mon, 18 Sep 2000 08:24:00 -0700


I don't know too much about searching and compression algorithms, but here's
a suggestion...  SWORD already has a compression/decompression module that
was added to support STEP format Bibles.  As I understand it, the algorithm
is public domain.  I haven't tested the compression portion of the code, but
decompression works.

Steve

-----Original Message-----
From: Drew Haninger [mailto:drew@OliveTree.com]
Sent: Monday, September 18, 2000 10:18 AM
To: sword-devel@crosswire.org
Subject: Re: [sword-devel] Some other fast search index options


With Palm Pilot's with more memory available, it would seem not worth the
effort, but with cell phones on the horizon as a new platform, the effort
may be worth it.  We never finished doing a better compression for the
smaller Palm Pilot's.

Drew Haninger

----- Original Message -----
From: "Nathan" <mail@nathan.co.za>
To: <sword-devel@crosswire.org>
Sent: Sunday, September 17, 2000 6:05 AM
Subject: RE: [sword-devel] Some other fast search index options


> Good day Jerry Hastings and others
>
> On 3rd September, Jerry Hastings wrote:
> > Here are three more methods to try.  Consider the bit map
> > divided into records, of say 32, 16, or 8 bits each.
>
> I did the tests as you recommended. The size of the index file
> comes down from 894 Kb to about 814 Kb.
> I also tried using Huffman compression, LZW, LZSS, plain RLE,
> or a combination of all the above. I got it to about 600K.
>
> The problem is that we are still compressing the BITMAP files.
> The uncompressed bitmap files are 48.8 Mbyte. Compressing this
> down to 600K is impressive, but we can do better by rather
> trying to compress the verse-lists (list of verses in which the
> word occurs). The uncompressed size of these lists is 1.23 Mb.
> (Option 1)
>
> Reading up on the subject in the latest research papers,
> they all seem to indicate that the best compression  of the
> index file is to use these verse-lists (option 1, also
> called 'inverted lists'), and compress these lists.
>
> The steps are:
> 1. Get the list of verses where the word occurs.
>    This can be an array of 16-bit words (for the Bible)
> 2. Get the differences between the verses (deltas), e.g.
>    if the verses are 3, 10, 15, 21, ...
>    the delta list would be 3, 7, 5, 6, ...
>    These deltas can be fairly much the same throughout the list.
> 3. Compress this delta-list with something like
>    arithmetic coding (seems to work best).
>
> To use this index, the following will have to be done:
> 1. Evaluate the search query
> 2. Get the words in the word-list.
> 3. Get their corresponding compressed delta-lists.
> 4. Uncompress.
> 5. Convert delta-list to verse-list
> 6. Convert to bitmap if you need to do NOT, OR, AND, or
>    wildcards
>
> Question: Is it really worth all that effort to save 400K
> on the size of the index? A few years ago I would have done
> it, but with the smallest hard disks for PC's being 8 Gb
> nowadays, is it really a big deal to have an index of 900K?
>
> If you really want to compress, then why keep the text of the
> Bible uncompressed? You can save much more compressing that! :)
>
> God bless,
> nathan
> http://www.nathan.co.za
>
>
>