[sword-devel] Announce: Sword/PDA for the Agenda PDA

David J. Orme sword-devel@crosswire.org
Mon, 22 Oct 2001 21:42:14 -0500

David Burry wrote:

>At 10:13 AM 10/22/2001 -0700, Chris Little wrote:
>>Would that be possible for a RE that involved crossing a word boundary?
>>Something like /\<Jesus \w+d\>/, for example.  I suppose you could split
>>up the RE itself by word boundaries, collecting a list of words that
>>match /\<Jesus\>/ and words that match /\<\w+d\>/, then finding all
>>instances where they come in order, separated by spaces.  But then you
>>have to account for \s+ and .+, at which point I would give up and just
>>reconstitute the whole verse string. :)
>For inverted indexes, I believe you need to restrict your regular expressions to matching individual words only (more precisely, whatever terms are indexed) if you want to take advantage of the performance increase.... but this isn't necessarily bad if you add more search operators such as phrase, within n words, followed by within n words (phrase = followed by within 1 word but it can be more optimized if done separately), etc...  In fact you could end up with a much richer operator set and still be lightning fast.
This would be the easiest way to do it.  You'll probably have all the other operators anyway, because joe user doesn't know REs and doesn't want to learn. 

However, I think it would be possible to analyze the RE (probably using REs), and break it down into multiple REs that each match a word, which transforms the problem into the one you described.  Weather this is worth the hassle is another question, though, as Chris pointed out.

>I'm not sure about punctuation, I need to review his documentation first.  I've done work on this kind of inverted index before and I'm itchin to see how he did that part!  ;o)  I've usually thought it desirable for text search engines to ignore punctuation anyway and just match words.
In my code, punctuation is treated like a word; each punctuation mark gets its own entry in the dictionary file, ....  See the docs / code for details.  Actually, the code that tokenizes the Bible into "words" for the dictionary is generated using flex, so you might just want to dig into that.

>See http://beaver.dburry.com/cgi-perl/bible to see what I've done on inverted indexes before, specifically geared toward speed at all costs otherwise.  It's not sword-based but I've been interested in integrating some of my stuff with sword for a long time (at least an import tool to convert sword modules to my index format), maybe this thing by Dave Orme will get me motivated!  Or maybe I'll scrap my work and just work on his...  I just want to see the best tool possible who cares about ego...  ;o)
Let me know what you think after you've read the docs/code. Maybe you 
can do it better than I can.  ;-)  I'll check out your web site too.



The number of UNIX installations has grown to 10, with more expected.
   -- The Unix Programmer's Manual, 2nd Edition, June 1972