[sword-devel] STEP modules
Troy A. Griffitts
Thu, 02 Nov 2000 21:25:52 -0700
We've had some more interest in the STEP area, and I was wondering if
you might be able to give us a rundown of what you were able to
accomplish? If you actually understand the STEP spec well enough to
decompress a module, I'd love for us to work together to add a low level
driver to the framework for this. I realize the RTF+STEPtoHTML filter
will be a much larger effort, but let's at least capture what you got to
work in code-- if you're willing.
"Trandahl, Steve" wrote:
> I don't know too much about searching and compression algorithms, but here's
> a suggestion... SWORD already has a compression/decompression module that
> was added to support STEP format Bibles. As I understand it, the algorithm
> is public domain. I haven't tested the compression portion of the code, but
> decompression works.
> -----Original Message-----
> From: Drew Haninger [mailto:drew@OliveTree.com]
> Sent: Monday, September 18, 2000 10:18 AM
> To: firstname.lastname@example.org
> Subject: Re: [sword-devel] Some other fast search index options
> With Palm Pilot's with more memory available, it would seem not worth the
> effort, but with cell phones on the horizon as a new platform, the effort
> may be worth it. We never finished doing a better compression for the
> smaller Palm Pilot's.
> Drew Haninger
> ----- Original Message -----
> From: "Nathan" <email@example.com>
> To: <firstname.lastname@example.org>
> Sent: Sunday, September 17, 2000 6:05 AM
> Subject: RE: [sword-devel] Some other fast search index options
> > Good day Jerry Hastings and others
> > On 3rd September, Jerry Hastings wrote:
> > > Here are three more methods to try. Consider the bit map
> > > divided into records, of say 32, 16, or 8 bits each.
> > I did the tests as you recommended. The size of the index file
> > comes down from 894 Kb to about 814 Kb.
> > I also tried using Huffman compression, LZW, LZSS, plain RLE,
> > or a combination of all the above. I got it to about 600K.
> > The problem is that we are still compressing the BITMAP files.
> > The uncompressed bitmap files are 48.8 Mbyte. Compressing this
> > down to 600K is impressive, but we can do better by rather
> > trying to compress the verse-lists (list of verses in which the
> > word occurs). The uncompressed size of these lists is 1.23 Mb.
> > (Option 1)
> > Reading up on the subject in the latest research papers,
> > they all seem to indicate that the best compression of the
> > index file is to use these verse-lists (option 1, also
> > called 'inverted lists'), and compress these lists.
> > The steps are:
> > 1. Get the list of verses where the word occurs.
> > This can be an array of 16-bit words (for the Bible)
> > 2. Get the differences between the verses (deltas), e.g.
> > if the verses are 3, 10, 15, 21, ...
> > the delta list would be 3, 7, 5, 6, ...
> > These deltas can be fairly much the same throughout the list.
> > 3. Compress this delta-list with something like
> > arithmetic coding (seems to work best).
> > To use this index, the following will have to be done:
> > 1. Evaluate the search query
> > 2. Get the words in the word-list.
> > 3. Get their corresponding compressed delta-lists.
> > 4. Uncompress.
> > 5. Convert delta-list to verse-list
> > 6. Convert to bitmap if you need to do NOT, OR, AND, or
> > wildcards
> > Question: Is it really worth all that effort to save 400K
> > on the size of the index? A few years ago I would have done
> > it, but with the smallest hard disks for PC's being 8 Gb
> > nowadays, is it really a big deal to have an index of 900K?
> > If you really want to compress, then why keep the text of the
> > Bible uncompressed? You can save much more compressing that! :)
> > God bless,
> > nathan
> > http://www.nathan.co.za