[sword-devel] Comming soon: new improved sword searching

15 Sep 2002 14:39:25 +1000

Joel,

Thank you for your response; comments are below.

On Sun, 2002-09-15 at 08:25, Joel Mawhorter wrote:
> Hi David,
> 
> Thanks for the response. My response is inline. 
> 
> On September 14, 2002 00:49, David White wrote:
> > Joel,
> >
> > It's great that you want to make Sword searching faster. You might like
> > to check out the simple but powerful search scheme I had on my Bible
> > program (http://www.whitevine.com/biblereader/), as I think it'd be
> > useful in Sword. Essentially, I just had one type of search, and one
> > could use AND, OR, and NOT to perform boolean operations. One could also
> > use a regular expression anyplace they wanted, without specifying they
> > were doing a regular expression search. The program would simply look
> > for regex meta-characters, and if it found them, set it to be a regular
> > expression search (and you could combine regular expressions, or regular
> > expressions and non regular expressions using boolean operations). 
> 
> Did you use an index for regular expression searching or did you just match 
> the regex against blocks of text? This is an area that I'm still not sure how 
> to best handle.

As I said below, I used an index for regular expressions where you can
work out that a certain word, or certain set of words, must be present
for the regular expression to match. Otherwise you have to do a full
text search.

> 
> > I didn't have the "search within 3 verses" thing, and I'm not sure how
> > useful that is; all I can say is try it and see.
> 
> This is potentially very useful (and I can say that fairly since I didn't come 
> up with the idea :-). Often in the Bible, sentences span multiple verses. 
> Therefore, if you remember a few words from a sentence in the Bible you may 
> not find it if you restrict your search to individual verses. Also people may 
> want to search for some passage that they vaguely remember and may need to 
> extend the search to groups of verses.

Yes, I can see how this could be useful.

> 
> > I would tend to recommend to use AND, OR, and NOT instead of &, |, and
> > !; since AND, OR, and NOT are likely to be recognizable to people who
> > have learned how to use any pre-Google search engine effectively, while
> > &, |, and ! are likely to be recognizable only to people who know C.
> 
> This is a good point but let me explain why I think the using AND, OR and NOT 
> isn't the best way:

ok, I still think that for an English-language frontend, using AND, OR,
and NOT is best; but I do agree that it is perhaps best to handle this
in the front end. The symbols !, and perhaps & (I don't think |) can
perceivably found in texts, so using them as literals should probably be
handled somehow as well.

[snip for brevity]

> 
> > Also, the reason Sword is slower than some other programs at searching
> > isn't because it doesn't use an index, but because it strips away rich
> > markup before it starts the search process; an expensive operation. The
> > way toward faster searching in my view, is to create a module format
> > which doesn't require this. A module which stores the rich markup in a
> > separate data structure, that has indexes into the text as to where it
> > is placed. Then you could simply iterate over the text when you search;
> > any decent computer can iterate over 4 megs of data in no time at all.
> > Even a 486 can execute such an operation lightning fast.
> 
> This is good to know. However, I don't think for searching it is necessary to 
> store markup seperate from the text. Instead I would like to ensure that the 
> index has enough information that searching will not have to be done on the 
> text itself, thus removing the need for striping the text. If the only disk 
> I/O involved is reading from specific spots in the index, searching should be 
> several times faster even than reading an entire markup-free text from disk 
> and searching that.

Well, remember that you are almost always going to have to load and
strip some verses, to make sure the phrase matches. For instance, if
someone searched for the phrase "Jesus said", then you are going to have
to search all verses which contain both "Jesus", and "said", and see if
the words appear sequentially. If someone searched for "Jesus & said",
naturally you aren't going to have to load anything in at all, but I
think this would be substantially less common.

Using an index is going to speed things up, but you don't *need* an
index to get fast search speeds. Just try using grep :-)

> 
> > That said, my Bible program did use indexing, in combination with the
> > approach I just mentioned. It would store the location of all words that
> > occurred less than a certain amount of times (1000 I believe). Other
> > words would have acknowledgement of their existence stored, but not the
> > actual verses - it's not worth storing the 40,000 verses that the word
> > "the" appears.
> 
> I'm don't think I agree on that point. The cost in disk space to store very 
> common words isn't really that high and people may want to search for them.

I seriously doubt anyone is going to want to search just for a common
word, like "and". If they do, then I don't think it's a big deal that
it'll take a couple of seconds to show them their results.

People are going to sometimes search for common words mixed with
uncommon words; for instance the phrase "the beginning". Now, if you
just store the word "beginning", and not "the", then you will get 104
verses (in the KJV), each of which you will have to load and search to
see if they match the exact phrase. If you also store "the" in your
index, then you will have 24,094 verses which match. You will have to
find the common subset of verses which contain "beginning" and "the",
and end up loading, stripping, and searching the 101 verses that contain
both. The cost of loading 24,094 verse references, and doing a search
through these references to find the common subset with 104 verse
references is going to far outstrip the cost of reading in 3 extra
verses and searching them. In this case, having the common words indexed
will end up being a pessimization instead of an optimization.

As for space, in my program, I stored the index as a list of words with
references to verses those words appeared in - a byte each for book,
chapter, and verse. So 3 bytes per reference, in addition to the cost of
the words themselves and a little meta-data. I don't think you'd be able
to get it much more compressed than that. Anyhow, if I stored the
locations of all words, the index file ballooned out to over 2 megabytes
(for the KJV, or any typical Bible). When I stored only words that
appeared less than 2000 times, it was just over 1 megabyte. That's going
to be a substantial difference on space-limited devices.

Also, remember that the file being smaller will give faster access, even
if you are only reading the same amount of data. Less levels of
indirection have to be used on an inode-based filesystem, and on FAT the
allocation table is smaller and faster to traverse; also there is a
greater chance of cache hits with subsequent access to different parts
of the file.

So anyhow, that's my case for not storing too-common words. I think you
should at least consider making it optional, so that the module creator
can decide if they want to index all words, or set a threshold, above
which existence-only is recorded.

> 
> > When someone did a search, the indexing system would be
> > called to reduce the possible verses to a series of ranges. It could
> > even help with regular expressions - I observed that the most common use
> > of regexps is to allow wild cards at the end of a word. e.g. eagle.* -
> > you thus know that all matching verses must have "eagle" or "eagles" in
> > them, so the indexing system could work out that the verse had to
> > contain a word with the prefix "eagle", and would find all words like
> > that, concatenating the list of possible verses. Naturally for some
> > regexs, the indexing system couldn't possibly help, but for most
> > real-world ones it could.
> 
> This is a good idea. Perhaps once I have the framework in place for indexing 
> and searching you could extend it with something like what you describe. 
> Unfortunately any time the index isn't sufficient to completely do the search 
> you will have to pay for stripping the text. Perhaps you could talk to Troy 
> and Chris about ways to either store the markup seperately (as you suggested) 
> or optimize the process of stripping the markup. I think the lack of regex 
> support is perhaps the most significant shortcoming of what I have proposed 
> so far. Of course, the current regex support will still be there but if it 
> can be speed up that would be great.

Well I don't see how the separation of markup from the text has any
effect on this suggestion, since you will still be having to load in
potentially matched verses to confirm that they match anyhow. You would
use the same sort of mechanism.

However, I am still convinced that separating markup and text is the
best solution. I am unsure how easy or hard it would be to create a
module type which does this though, so it may or may not be feasible.
Also, how are you planning to handle searching of markup itself? Using
separation, it would be fairly easy: just iterate over the markup too,
and search any markup that is flagged as being "searchable". Did you
have plans to handle searching of markup in your system?

> 
> > I'd be interested in helping out with this, if you want any help; I'm
> > trying to get involved with Sword development myself. Let me know if you
> > want any assistance.
> 
> I would love to have help. Once I get a bit further in the implementation, 
> I'll have a better idea of what help I need with. Probably I could use help 
> in optimizing the searching code I create. I'll keep you posted on what I'm 
> doing.

ok, great; I look forward to hearing about your progress on this.

-David.

> 
> Thanks,
> 
> Joel
> 
> > Blessings,
> >
> > -David.
> >
> > On Sun, 2002-09-08 at 15:42, Joel Mawhorter wrote:
> > > With 95% less time and 7 essential nutrients!
> > >
> > > Hi all,
> > >
> > > Most of you don't know me but I've been hanging out in this list for a
> > > few years. I've been working on a Bible search program that I started in
> > > my last year of University as a guided project. My focus with this Bible
> > > program was to implement full featured searching for non-Latin based
> > > languages. What I want to see is people all over the world able to study
> > > the Bible in their own language. Several times in the past I have
> > > evaluated Sword and considered just putting my effort into that but the
> > > support for non-Latin languages just wasn't there. However, it now seems
> > > to be getting much closer and I think Sword will be more useful than what
> > > I could produce on my own.  Therefore, I've decided to join the Sword
> > > development project. My first priority is to make a few improvements to
> > > the searching mechanism in Sword. I am writing to the list to get
> > > feedback while I am still in the planning and early implementation stages
> > > of my work.
> > >
> > > The first area that I will be working on is adding a new type of search
> > > to Sword. The new search type will be based on typical boolean search
> > > operations (AND, OR, NOT,and maybe XOR using the operators &, |, !, and ^
> > > respectively). Grouping with parenthases will be supported. For example,
> > > (God & (Father | Son | Spirit)) will give you all of the verses that have
> > > the word "God" and at least one of the words "Father", "Son" and
> > > "Spirit". Both word and phrase search terms will be supported within the
> > > same search expression. For example, (Jesus & "son of God") will find all
> > > verses with both the word and the phrase in them. I will also be adding a
> > > specialized AND operator that considers verse proximity. For example,
> > > ("lamb of God", Jesus, "take away", sins @3) will find all combinations
> > > of verses within 3 of each other that have all the search terms in them.
> > > This could be one verse that has all the search terms or any set of n
> > > verses (where n <= the number of search terms), each with one or more of
> > > the search terms, such that the two verses in the set that are fartest
> > > apart do not have more than two verses in between. I will also allow
> > > simple wildcards. I'm not sure how simple or complex that will be yet but
> > > at a minimum will allow something like (Jesus & lov*) which will find
> > > love, loving, etc. All of the above functions will be useable within one
> > > search expression. For example:
> > > ((one*,"a phrase",two@2) ^ (three & !(four | five)). I'm not certain
> > > anyone would ever need a search expression of that complexity but it just
> > > gives an example of what will be possible. I intend this search
> > > functionality to be practical superset of the existing search types. It
> > > won't be exactly a superset since it won't have full regular expression
> > > support. However, I think that with the functionality available, regular
> > > expressions won't be necessary. If any of you can think of an example of
> > > something that you do with the current regular expression searching that
> > > won't be possible with what I described above, please let me know.
> > >
> > > The second area that I will be working on is adding indexed searching
> > > where searching can be done on a precomputed index of search terms rather
> > > than the current mechanism where the whole Bible has to be read in from
> > > disk and searched in a brute force manner. This should decrease the
> > > search time to a very small fraction of what it currently is. One
> > > downside of indexed searching is that full regular expression searching
> > > isn't very feasible. I'll leave it as an exercise for the reader to
> > > verify that searching for /a.*b/ would be neither be very easy to
> > > implement nor very fast using an index (grin).
> > >
> > > I would really appreciate all of the feedback I can get on this since I
> > > would like the searching capabilities of Sword to as strong as is
> > > reasonably possible. If you see any problems with what I am suggesting or
> > > if you have suggestions for other improvements to searching please send
> > > them to the list.
> > >
> > > In Christ,
> > >
> > > Joel Mawhorter
>