[sword-devel] Comming soon: new improved sword searching

Joel Mawhorter sword-devel@crosswire.org
Sun, 08 Sep 2002 09:54:04 -0700


On September 8, 2002 01:01, David Burry wrote:
> Awesome!  Poor search capability and speed is the primary reason I've been
> only lurking on this list too for so long instead of contributing.  I don't
> have the C experience to improve that very much easily, and don't have the
> time to dive in and learn at the moment to invent better searching for
> sword from scratch.  But I have done some work on fast indexing and
> searching algorithms, maybe we can work together... you can see some of my
> tests here:

I would be glad to have your help. Once I get feedback on what functionality 
people would like to have, I send out a description of how I plan to 
implement that. I would greatly appreciate feedback on both of those stages 
and since you've already implemented something like this you probably have 
some good insight into how to do things or how not to do things (if you tried 
anything that didn't work well). 

> http://beaver.dburry.com/cgi-perl/bible
>
> You will see the framework and a lot of implementation of what you're
> talking about is already there (albeit with different syntax and some added
> things), but it's not in C it's in Perl, therefore of little use to sword
> at the moment without porting it.  Complicated searches generally take
> about 200 milliseconds or so (on an old 233 mhz PII).

I took a look over your syntax and features and I think there are some things 
there that might be worth including. Actually I was just talking with my wife 
about Bible searching last night and she suggested some sort of word ordering 
search which I notice you have on your site. I may ask you some questions 
about some of the operators you support and how you implemented them.

> You will notice I did implement full regexp searching on the indexed
> version, by limiting regexps to only search the list of indexed words not
> the full text.  Yes, it's a bit slow (usually 2 seconds) to search the
> entire word list with regexps, but hey still beats the pants off a brute
> force full text regexp!

I have considered this but haven't decided whether or not to do it. It is 
questionable how useful it is to do full regex searches on just the search 
terms. When would you ever use anything except wildcards? Regexps really have 
much more power when searching large blocks of text rather than single words. 
For instance, with the example I gave (/a.*b/), when would you ever really 
want to know all of the verses that had search terms with an "a" followed by 
a "b"? 

> For instance, using your example:
>
> http://beaver.dburry.com/cgi-perl/bible?encoding=utf-8&refstr=kjv&srchstr=%
>2 Fa.*b%2F&maxverses=30
>
> this one is particularly long, and takes about 3.5 seconds to run the
> regexp... and then 1.2 seconds to find all the bazillion verses that match
> (see the speed notes at the bottom, this latter part can be optimized since
> we're only showing the first 30 matches so who cares about locating the
> rest internally)... don't everyone hit this all at once you could run my
> poor home linux machine on a DSL out of memory or something, haha, no I
> don't really think so I have the apache processes limited so you'd have to
> wait your turns instead ;o)
>
> I also have some ideas not yet implemented about creating a "phrase index"
> that indexes word pairs instead of words... it should be able to be used
> almost the same as my normal word index for most things, but makes phrase
> searches faster still and eliminates the loooong failure case for when you
> search for something like the phrase  "the the" (this ties up the machine
> for like 25 secs and then at last you get a failure notice)  The phrase
> index is no bigger than the normal word index, in fact it can be smaller
> because the data is more evenly distributed among the hash buckets
> internally, therefore requiring fewer of them.  However, my particular
> indexing algorithm is probably too large for desktop use, but not bad for a
> web server because it's so faaaast!  It uses a Berkeley DB B+Tree to store
> the main part of the index.  The phrase index will be important to improve
> the speed for languages like Chinese, where you index each character.  I
> noticed that in such languages my algorithm starts bogging down because
> everything is found so often, it can't start with an unpopular word to make
> it quick, and the phrase index can help dramatically there.  It would also
> help in other languages if we ever indexed letter-by-letter instead of
> word-by-word (which is probably useless in most languages, but... you never
> know)

A word pair index would be potentially VERY LARGE since the number of unique 
two word pairs in the Bible is much larger than the number of unique words. I 
think most reasonable phrase searches will work fairly fast even with just a 
word index. I'm not too concerned about pathalogical cases such as "the the". 
If someone really wants to search for that then they can just sit around and 
wait.

Regarding Chinese, I have already developed an indexing method for Chinese 
that is fairly small and fast but doesn't use phrase indexing. I got sub 1 
second times for all of the searches I tried, even complex ones (and that was 
in Java). Actually Chinese was the first language I worked on in my previous 
Bible program since the guided project that I did in university was on 
indexing Chinese text and the proof of concept for that course was the Bible 
program I have been working on. The index I created was ~ 4.5 Mb. If you're 
interested in the indexing method, I can send you a description of it.

> I've designed but not yet implemented a much more complicated index storage
> system that should be smaller (about 5 megs per bible version) and still
> preserve all the extended functionality of my existing one (which is about
> 25 megs per version) without reducing speed any (in fact it may increase
> speed, not sure yet).  This came out of the idea of trying to design an
> order preserving minimal perfect hash function for my index, when it dawned
> on me that my data is so predictable I don't need to use hashes for most
> things just some complicated tables that reference tables that reference
> tables that reference tables, etc.  i.e. that large hash function by itself
> would be as good as the index it was trying to describe, yet much smaller
> because no holes between the hash values.  Again, this better index is not
> yet coded, only on the drawing board.

Is your index 25 Mb on disk or in memory? I'm curious to know what kind of 
speed and functionality advantages this method has.

I've built an index of the KJV using the indexing scheme I am currently 
considering for word based languages (i.e. versus character based, such as 
Chinese) and the index was ~2 Mb. However, swords keysize is 4 bytes and I 
used 3 bytes so for Sword I expect it to be ~2.5 Mb - 3 Mb. This method is 
probably similar to the method you are thinking of implementing. I'll write 
up a short description of the way I did it and post it on my web site.

> I've also done some work on prediction of how long a search will take
> beforehand, so as to help eliminate DOS attacks if this were run on a web
> server (for instance), so that you can refuse to run a search that will
> display too many results and take too much resources.  This is completely
> disabled in my test script above, because I want to test the real thing...
> (the non-existant phrase "the the" is currently the longest thing possible
> in english I think).

The ability to limit search time might be something to consider, especially 
since Sword might be used by servers. However one reason you are seeing 25 
second search times is that you are using Perl so you will see much longer 
search times than you would in C/C++. I expect based on my experience with 
Java that any reasonable search in C/C++ will be < 1 sec on a not very fast 
machine. I would expect searching for "the the" to take < 5 secs. Of course 
it is always possible to generate searches that take an arbitrarily long 
time. For example (the | the | the | the | the | the | the | the ...). 
Possibly a smart searching algorithm could eliminate that possibility but I 
don't think it is worth the amount of work to make the searching algorithms 
that smart. 

Thanks for the comments. Keep them comming!

Joel

> Dave
>
>
> ----- Original Message -----
> From: "Joel Mawhorter" <joel-ml@mawhorter.org>
> To: <sword-devel@crosswire.org>
> Sent: Saturday, September 07, 2002 10:42 PM
> Subject: [sword-devel] Comming soon: new improved sword searching
>
> > With 95% less time and 7 essential nutrients!
> >
> > Hi all,
> >
> > Most of you don't know me but I've been hanging out in this list for a
> > few years. I've been working on a Bible search program that I started in
> > my
>
> last
>
> > year of University as a guided project. My focus with this Bible program
>
> was
>
> > to implement full featured searching for non-Latin based languages. What
> > I want to see is people all over the world able to study the Bible in
> > their
>
> own
>
> > language. Several times in the past I have evaluated Sword and considered
> > just putting my effort into that but the support for non-Latin languages
>
> just
>
> > wasn't there. However, it now seems to be getting much closer and I think
> > Sword will be more useful than what I could produce on my own. 
> > Therefore, I've decided to join the Sword development project. My first
> > priority is
>
> to
>
> > make a few improvements to the searching mechanism in Sword. I am writing
>
> to
>
> > the list to get feedback while I am still in the planning and early
> > implementation stages of my work.
>
> ...