[sword-devel] Another Important Issue

Joe Walker sword-devel@crosswire.org
Tue, 29 Aug 2000 04:22:40 +0100


Hi,

If I understand the problem correctly the problem is that a search for
"the" or "Lord" comes up with lots of hits and storing all those hits
in a fast index file uses a lot of space.

My Java program uses 3 ways to store lists of verses to combat this.
They are all available either in memory or on disk.

I have an interface "Passage" that stores a list of verses and 3
implementations:

DistinctPassage - a simple list of verses.
RangedPassage - stores a list of verse pairs (start and end)
BitwisePassage - stores a 31k long bitmap. One bit per verse

The latter can promise that any result set can be stored in only a
few K, and a fairly simple bit of maths can be used to work out which
is the best algorithm to use.
The latter also makes for very very very fast result set combining
methods. To search for "moses" AND "manna" I simple AND the bitmaps
together.

Shout if you want to know more. All code is GPL.

Joe.



"Troy A. Griffitts" wrote:
> 
> Martin,
>         Thanks for the post.  This is exactly what we are doing with the
> reference implementation of a fast searching framework.  We do one
> search for each word in the text and create an index of every word with
> verse references for each.  We save this index and every time a search
> is performed, we ask the index for the references for the word.  And,
> yes, as you said, we do multiword searches this way also.
> 
> Problems come with large result sets.  You see, not only do we have to
> find verse references for the word[s], we also have to verify that the
> verse references are within the search range specified (valid for the
> key used to specify the search bounds).  This entails iterating through
> the search results and asking the key if each one is valid.  For
> extremely large result sets, this takes just as long as searching the
> entire text, actually sometimes longer than the default searching
> mechanism.
> 
> Any suggestions on how to speed up this process would be greatly
> appreciated.
> 
>         -Troy.
> 
> Martin Gruner wrote:
> >
> > Another feature request:
> >
> > At the moment you can use sword to retrieve text (a list of words) by a key
> > (bible reference).
> > Is it possible to retrieve keys (a list of) by a word? I am not talking about
> > searching. I am talking about something like a concordance. This would
> > involve creating a file for every module that contains information about the
> > location of every single word in the module.
> > For example, if I look up "mesch", sword tells me that this word is not in
> > the module, but the words "mescha", "meschar", "meschelemja" ....
> > If I look up "meschelemja", sword will give me 3 references to where this
> > word occures in the bible.
> > Once this would be implemented, searches for a single word would be speeded
> > up amazingly, because sword would just look them up in the concordance. You
> > could even perform multi word searches using this mechanism.
> > I do not know how realistic this is, but it is at least another (discussable)
> > idea.
> >
> > Martin