[sword-devel] Comming soon: new improved sword searching

David White sword-devel@crosswire.org
14 Sep 2002 17:49:13 +1000


Joel,

It's great that you want to make Sword searching faster. You might like
to check out the simple but powerful search scheme I had on my Bible
program (http://www.whitevine.com/biblereader/), as I think it'd be
useful in Sword. Essentially, I just had one type of search, and one
could use AND, OR, and NOT to perform boolean operations. One could also
use a regular expression anyplace they wanted, without specifying they
were doing a regular expression search. The program would simply look
for regex meta-characters, and if it found them, set it to be a regular
expression search (and you could combine regular expressions, or regular
expressions and non regular expressions using boolean operations). I
didn't have the "search within 3 verses" thing, and I'm not sure how
useful that is; all I can say is try it and see.

I would tend to recommend to use AND, OR, and NOT instead of &, |, and
!; since AND, OR, and NOT are likely to be recognizable to people who
have learned how to use any pre-Google search engine effectively, while
&, |, and ! are likely to be recognizable only to people who know C.

Also, the reason Sword is slower than some other programs at searching
isn't because it doesn't use an index, but because it strips away rich
markup before it starts the search process; an expensive operation. The
way toward faster searching in my view, is to create a module format
which doesn't require this. A module which stores the rich markup in a
separate data structure, that has indexes into the text as to where it
is placed. Then you could simply iterate over the text when you search;
any decent computer can iterate over 4 megs of data in no time at all.
Even a 486 can execute such an operation lightning fast.

That said, my Bible program did use indexing, in combination with the
approach I just mentioned. It would store the location of all words that
occurred less than a certain amount of times (1000 I believe). Other
words would have acknowledgement of their existence stored, but not the
actual verses - it's not worth storing the 40,000 verses that the word
"the" appears. When someone did a search, the indexing system would be
called to reduce the possible verses to a series of ranges. It could
even help with regular expressions - I observed that the most common use
of regexps is to allow wild cards at the end of a word. e.g. eagle.* -
you thus know that all matching verses must have "eagle" or "eagles" in
them, so the indexing system could work out that the verse had to
contain a word with the prefix "eagle", and would find all words like
that, concatenating the list of possible verses. Naturally for some
regexs, the indexing system couldn't possibly help, but for most
real-world ones it could.

I'd be interested in helping out with this, if you want any help; I'm
trying to get involved with Sword development myself. Let me know if you
want any assistance.

Blessings,

-David.

On Sun, 2002-09-08 at 15:42, Joel Mawhorter wrote:
> With 95% less time and 7 essential nutrients!
> 
> Hi all,
> 
> Most of you don't know me but I've been hanging out in this list for a few 
> years. I've been working on a Bible search program that I started in my last 
> year of University as a guided project. My focus with this Bible program was 
> to implement full featured searching for non-Latin based languages. What I 
> want to see is people all over the world able to study the Bible in their own 
> language. Several times in the past I have evaluated Sword and considered 
> just putting my effort into that but the support for non-Latin languages just 
> wasn't there. However, it now seems to be getting much closer and I think 
> Sword will be more useful than what I could produce on my own.  Therefore, 
> I've decided to join the Sword development project. My first priority is to 
> make a few improvements to the searching mechanism in Sword. I am writing to 
> the list to get feedback while I am still in the planning and early 
> implementation stages of my work.
> 
> The first area that I will be working on is adding a new type of search to 
> Sword. The new search type will be based on typical boolean search operations 
> (AND, OR, NOT,and maybe XOR using the operators &, |, !, and ^ respectively). 
> Grouping with parenthases will be supported. For example, (God & (Father | 
> Son | Spirit)) will give you all of the verses that have the word "God" and 
> at least one of the words "Father", "Son" and "Spirit". Both word and phrase 
> search terms will be supported within the same search expression. For 
> example, (Jesus & "son of God") will find all verses with both the word and 
> the phrase in them. I will also be adding a specialized AND operator that 
> considers verse proximity. For example, ("lamb of God", Jesus, "take away", 
> sins @3) will find all combinations of verses within 3 of each other that 
> have all the search terms in them. This could be one verse that has all the 
> search terms or any set of n verses (where n <= the number of search terms), 
> each with one or more of the search terms, such that the two verses in the 
> set that are fartest apart do not have more than two verses in between. I 
> will also allow simple wildcards. I'm not sure how simple or complex that 
> will be yet but at a minimum will allow something like (Jesus & lov*) which 
> will find love, loving, etc. All of the above functions will be useable 
> within one search expression. For example: 
> ((one*,"a phrase",two@2) ^ (three & !(four | five)). I'm not certain anyone 
> would ever need a search expression of that complexity but it just gives an 
> example of what will be possible. I intend this search functionality to be 
> practical superset of the existing search types. It won't be exactly a 
> superset since it won't have full regular expression support. However, I 
> think that with the functionality available, regular expressions won't be 
> necessary. If any of you can think of an example of something that you do 
> with the current regular expression searching that won't be possible with 
> what I described above, please let me know.
> 
> The second area that I will be working on is adding indexed searching where 
> searching can be done on a precomputed index of search terms rather than the 
> current mechanism where the whole Bible has to be read in from disk and 
> searched in a brute force manner. This should decrease the search time to a 
> very small fraction of what it currently is. One downside of indexed 
> searching is that full regular expression searching isn't very feasible. I'll 
> leave it as an exercise for the reader to verify that searching for /a.*b/ 
> would be neither be very easy to implement nor very fast using an index 
> (grin).
> 
> I would really appreciate all of the feedback I can get on this since I would 
> like the searching capabilities of Sword to as strong as is reasonably 
> possible. If you see any problems with what I am suggesting or if you have 
> suggestions for other improvements to searching please send them to the list.
> 
> In Christ,
> 
> Joel Mawhorter
> 
>