[sword-devel] Another Important Issue

Tue, 29 Aug 2000 08:48:03 -1000

What I did with SwordSearcher is similar to your bitmap. A search creates an
array of 31102 bytes. Each key is looked up in the pre-index, and in each
verse it appears in gets its value incremented in the array.

For a simple AND search, you just need to make sure that each byte has the
correct value. An OR or NOT search is just as easy. If a phrase search or
case sensitive search is done, each verse needs to be individually verified,
but you've already removed every verse that CANT be correct, so it still
takes less time.

I have never seen any realistic searches done that are actually slower with
this method. In SwordSearcher at least, on all of my machines, searches
appear almost instantaneously.

Joe's individual bitmaps for each key is a better idea if you are going to
implement wildcard search abilities.

-Brandon

----- Original Message -----
From: "Joe Walker" <joe@eireneh.com>
To: <sword-devel@crosswire.org>
Sent: Monday, August 28, 2000 5:22 PM
Subject: Re: [sword-devel] Another Important Issue

>
> Hi,
>
> If I understand the problem correctly the problem is that a search for
> "the" or "Lord" comes up with lots of hits and storing all those hits
> in a fast index file uses a lot of space.
>
> My Java program uses 3 ways to store lists of verses to combat this.
> They are all available either in memory or on disk.
>
> I have an interface "Passage" that stores a list of verses and 3
> implementations:
>
> DistinctPassage - a simple list of verses.
> RangedPassage - stores a list of verse pairs (start and end)
> BitwisePassage - stores a 31k long bitmap. One bit per verse
>
> The latter can promise that any result set can be stored in only a
> few K, and a fairly simple bit of maths can be used to work out which
> is the best algorithm to use.
> The latter also makes for very very very fast result set combining
> methods. To search for "moses" AND "manna" I simple AND the bitmaps
> together.
>
> Shout if you want to know more. All code is GPL.
>
> Joe.
>
>
>
> "Troy A. Griffitts" wrote:
> >
> > Martin,
> >         Thanks for the post.  This is exactly what we are doing with the
> > reference implementation of a fast searching framework.  We do one
> > search for each word in the text and create an index of every word with
> > verse references for each.  We save this index and every time a search
> > is performed, we ask the index for the references for the word.  And,
> > yes, as you said, we do multiword searches this way also.
> >
> > Problems come with large result sets.  You see, not only do we have to
> > find verse references for the word[s], we also have to verify that the
> > verse references are within the search range specified (valid for the
> > key used to specify the search bounds).  This entails iterating through
> > the search results and asking the key if each one is valid.  For
> > extremely large result sets, this takes just as long as searching the
> > entire text, actually sometimes longer than the default searching
> > mechanism.
> >
> > Any suggestions on how to speed up this process would be greatly
> > appreciated.
> >
> >         -Troy.
> >
> > Martin Gruner wrote:
> > >
> > > Another feature request:
> > >
> > > At the moment you can use sword to retrieve text (a list of words) by
a key
> > > (bible reference).
> > > Is it possible to retrieve keys (a list of) by a word? I am not
talking about
> > > searching. I am talking about something like a concordance. This would
> > > involve creating a file for every module that contains information
about the
> > > location of every single word in the module.
> > > For example, if I look up "mesch", sword tells me that this word is
not in
> > > the module, but the words "mescha", "meschar", "meschelemja" ....
> > > If I look up "meschelemja", sword will give me 3 references to where
this
> > > word occures in the bible.
> > > Once this would be implemented, searches for a single word would be
speeded
> > > up amazingly, because sword would just look them up in the
concordance. You
> > > could even perform multi word searches using this mechanism.
> > > I do not know how realistic this is, but it is at least another
(discussable)
> > > idea.
> > >
> > > Martin