[sword-devel] Announce: Sword/PDA for the Agenda PDA

David Burry sword-devel@crosswire.org
Tue, 23 Oct 2001 16:48:47 -0700

To learn what an "inverted index" is, I highly recommend the book "Managing Gigabytes" by Witten, Moffat, and Bell.  They explain it way better than I could.

I don't think you're describing an "inverted index" you're describing a unique word list, with keys of the verse ids found.  I'd have to refer back to that book to remember the technical term for your index but it's there... mine is bigger because there is one key for each actual word instance (not just for each unique word like yours).  So I have more keys by a few orders of magnitude.  The exact format of my index is complex and spans multiple databases for the whole thing, I'll write it out and post it on the web sometime...  Let me just say for now that it's designed in a somewhat space-wasteful way because this way one single (!!) B+Tree lookup call can tell if a certain unique word exists in the same book, chapter, verse, or within n books, chapters, verses, or words, of any other certain word instance in the whole version.  In fact what I'm doing is not covered in that book and is laughed at by most people as quite a waste of space but it sure does shine in speed and versatility like no other.  I've been looking for some more efficient hashing algorithms for the B+Tree, it has the potential of shrinking my index down to just a couple megs...


At 01:02 PM 10/23/2001 -0500, Jesse Jacobsen wrote:
>On 10/23/01, David Burry wrote:
>> In fact I didn't even compress the inverted index at all, used
>> Berkley DB (B+tree) instead which is quite wasteful of space (makes
>> it take about 25 megs per version) but it sure is
>> ***blazing***fast*** as a result!!!  
>I've got what may be an "inverted index" for the same purpose, using
>the Berkeley B+tree database as well.  Maybe I just don't understand
>what an inverted index is, but mine ends up be about 4.3mb per version
>for the database.  Keys are the words themselves (no punctuation
>except what may exist within a word), in lower case.  Values are
>packed integer arrays, each integer being the verse index where the
>word is found.  If you're recording more than that, I'd like to know
>what your database looks like.
>Keep in mind I've only taken a handful of programming courses, so I
>might be a little hazy on some terms.
>If we confess our sins, He is faithful and just to forgive us our sins
>and to cleanse us from all unrighteousness.  If we say that we have
>not sinned, we make Him a liar, and His word is not in us.
>                            http://www.grace-els.org