[sword-devel] Comming soon: new improved sword searching

Joel Mawhorter sword-devel@crosswire.org
Sat, 14 Sep 2002 15:25:32 -0700


Hi David,

Thanks for the response. My response is inline. 

On September 14, 2002 00:49, David White wrote:
> Joel,
>
> It's great that you want to make Sword searching faster. You might like
> to check out the simple but powerful search scheme I had on my Bible
> program (http://www.whitevine.com/biblereader/), as I think it'd be
> useful in Sword. Essentially, I just had one type of search, and one
> could use AND, OR, and NOT to perform boolean operations. One could also
> use a regular expression anyplace they wanted, without specifying they
> were doing a regular expression search. The program would simply look
> for regex meta-characters, and if it found them, set it to be a regular
> expression search (and you could combine regular expressions, or regular
> expressions and non regular expressions using boolean operations). 

Did you use an index for regular expression searching or did you just match 
the regex against blocks of text? This is an area that I'm still not sure how 
to best handle.

> I didn't have the "search within 3 verses" thing, and I'm not sure how
> useful that is; all I can say is try it and see.

This is potentially very useful (and I can say that fairly since I didn't come 
up with the idea :-). Often in the Bible, sentences span multiple verses. 
Therefore, if you remember a few words from a sentence in the Bible you may 
not find it if you restrict your search to individual verses. Also people may 
want to search for some passage that they vaguely remember and may need to 
extend the search to groups of verses.

> I would tend to recommend to use AND, OR, and NOT instead of &, |, and
> !; since AND, OR, and NOT are likely to be recognizable to people who
> have learned how to use any pre-Google search engine effectively, while
> &, |, and ! are likely to be recognizable only to people who know C.

This is a good point but let me explain why I think the using AND, OR and NOT 
isn't the best way:

1. I would like to include all words in a module in the indexes (i.e. not have 
a stop list). Therefore, AND, OR and NOT will all be possible search terms. 
This makes the parsing of searchs terms more difficult if the parser has to 
be aware of the context of words to determine their meaning. Sticking just 
will non-alphabetic search operators makes parsing much nicer.

2. I hope that frontend writers will include appropriate, user friendly ways 
of searching. For example, an idea that was discussed on this list a while 
back was that it would be nice to have a simple search mode and an advanced 
search mode.  The advanced mode could just take search expression strings and 
pass them to Sword. For an advanced mode, I don't think using &,| and ! are 
problems. A simple search mode could be done by the user filling out a form 
(or some other simple way) and would not have to expose the user to the 
underlying syntax used by Sword.

3. Sword will hopefully be used by many people who don't speak English. To 
many of those people, AND, OR and NOT probably have no special meaning and 
might be less easy to remeber than a single character.

If front end writers really disagree with me on this, they will have the 
option of passing in a parse tree instead of a search expression. This parse 
tree can be constructed using any grammer the frontend designers think is 
best and the functionality provided will be the same as if they passed in a 
search expression.

> Also, the reason Sword is slower than some other programs at searching
> isn't because it doesn't use an index, but because it strips away rich
> markup before it starts the search process; an expensive operation. The
> way toward faster searching in my view, is to create a module format
> which doesn't require this. A module which stores the rich markup in a
> separate data structure, that has indexes into the text as to where it
> is placed. Then you could simply iterate over the text when you search;
> any decent computer can iterate over 4 megs of data in no time at all.
> Even a 486 can execute such an operation lightning fast.

This is good to know. However, I don't think for searching it is necessary to 
store markup seperate from the text. Instead I would like to ensure that the 
index has enough information that searching will not have to be done on the 
text itself, thus removing the need for striping the text. If the only disk 
I/O involved is reading from specific spots in the index, searching should be 
several times faster even than reading an entire markup-free text from disk 
and searching that.

> That said, my Bible program did use indexing, in combination with the
> approach I just mentioned. It would store the location of all words that
> occurred less than a certain amount of times (1000 I believe). Other
> words would have acknowledgement of their existence stored, but not the
> actual verses - it's not worth storing the 40,000 verses that the word
> "the" appears.

I'm don't think I agree on that point. The cost in disk space to store very 
common words isn't really that high and people may want to search for them.

> When someone did a search, the indexing system would be
> called to reduce the possible verses to a series of ranges. It could
> even help with regular expressions - I observed that the most common use
> of regexps is to allow wild cards at the end of a word. e.g. eagle.* -
> you thus know that all matching verses must have "eagle" or "eagles" in
> them, so the indexing system could work out that the verse had to
> contain a word with the prefix "eagle", and would find all words like
> that, concatenating the list of possible verses. Naturally for some
> regexs, the indexing system couldn't possibly help, but for most
> real-world ones it could.

This is a good idea. Perhaps once I have the framework in place for indexing 
and searching you could extend it with something like what you describe. 
Unfortunately any time the index isn't sufficient to completely do the search 
you will have to pay for stripping the text. Perhaps you could talk to Troy 
and Chris about ways to either store the markup seperately (as you suggested) 
or optimize the process of stripping the markup. I think the lack of regex 
support is perhaps the most significant shortcoming of what I have proposed 
so far. Of course, the current regex support will still be there but if it 
can be speed up that would be great.

> I'd be interested in helping out with this, if you want any help; I'm
> trying to get involved with Sword development myself. Let me know if you
> want any assistance.

I would love to have help. Once I get a bit further in the implementation, 
I'll have a better idea of what help I need with. Probably I could use help 
in optimizing the searching code I create. I'll keep you posted on what I'm 
doing.

Thanks,

Joel

> Blessings,
>
> -David.
>
> On Sun, 2002-09-08 at 15:42, Joel Mawhorter wrote:
> > With 95% less time and 7 essential nutrients!
> >
> > Hi all,
> >
> > Most of you don't know me but I've been hanging out in this list for a
> > few years. I've been working on a Bible search program that I started in
> > my last year of University as a guided project. My focus with this Bible
> > program was to implement full featured searching for non-Latin based
> > languages. What I want to see is people all over the world able to study
> > the Bible in their own language. Several times in the past I have
> > evaluated Sword and considered just putting my effort into that but the
> > support for non-Latin languages just wasn't there. However, it now seems
> > to be getting much closer and I think Sword will be more useful than what
> > I could produce on my own.  Therefore, I've decided to join the Sword
> > development project. My first priority is to make a few improvements to
> > the searching mechanism in Sword. I am writing to the list to get
> > feedback while I am still in the planning and early implementation stages
> > of my work.
> >
> > The first area that I will be working on is adding a new type of search
> > to Sword. The new search type will be based on typical boolean search
> > operations (AND, OR, NOT,and maybe XOR using the operators &, |, !, and ^
> > respectively). Grouping with parenthases will be supported. For example,
> > (God & (Father | Son | Spirit)) will give you all of the verses that have
> > the word "God" and at least one of the words "Father", "Son" and
> > "Spirit". Both word and phrase search terms will be supported within the
> > same search expression. For example, (Jesus & "son of God") will find all
> > verses with both the word and the phrase in them. I will also be adding a
> > specialized AND operator that considers verse proximity. For example,
> > ("lamb of God", Jesus, "take away", sins @3) will find all combinations
> > of verses within 3 of each other that have all the search terms in them.
> > This could be one verse that has all the search terms or any set of n
> > verses (where n <= the number of search terms), each with one or more of
> > the search terms, such that the two verses in the set that are fartest
> > apart do not have more than two verses in between. I will also allow
> > simple wildcards. I'm not sure how simple or complex that will be yet but
> > at a minimum will allow something like (Jesus & lov*) which will find
> > love, loving, etc. All of the above functions will be useable within one
> > search expression. For example:
> > ((one*,"a phrase",two@2) ^ (three & !(four | five)). I'm not certain
> > anyone would ever need a search expression of that complexity but it just
> > gives an example of what will be possible. I intend this search
> > functionality to be practical superset of the existing search types. It
> > won't be exactly a superset since it won't have full regular expression
> > support. However, I think that with the functionality available, regular
> > expressions won't be necessary. If any of you can think of an example of
> > something that you do with the current regular expression searching that
> > won't be possible with what I described above, please let me know.
> >
> > The second area that I will be working on is adding indexed searching
> > where searching can be done on a precomputed index of search terms rather
> > than the current mechanism where the whole Bible has to be read in from
> > disk and searched in a brute force manner. This should decrease the
> > search time to a very small fraction of what it currently is. One
> > downside of indexed searching is that full regular expression searching
> > isn't very feasible. I'll leave it as an exercise for the reader to
> > verify that searching for /a.*b/ would be neither be very easy to
> > implement nor very fast using an index (grin).
> >
> > I would really appreciate all of the feedback I can get on this since I
> > would like the searching capabilities of Sword to as strong as is
> > reasonably possible. If you see any problems with what I am suggesting or
> > if you have suggestions for other improvements to searching please send
> > them to the list.
> >
> > In Christ,
> >
> > Joel Mawhorter