[sword-devel] Announce: Sword/PDA for the Agenda PDA
Mon, 22 Oct 2001 12:10:17 -0700
At 10:13 AM 10/22/2001 -0700, Chris Little wrote:
>Would that be possible for a RE that involved crossing a word boundary?
>Something like /\<Jesus \w+d\>/, for example. I suppose you could split
>up the RE itself by word boundaries, collecting a list of words that
>match /\<Jesus\>/ and words that match /\<\w+d\>/, then finding all
>instances where they come in order, separated by spaces. But then you
>have to account for \s+ and .+, at which point I would give up and just
>reconstitute the whole verse string. :)
For inverted indexes, I believe you need to restrict your regular expressions to matching individual words only (more precisely, whatever terms are indexed) if you want to take advantage of the performance increase.... but this isn't necessarily bad if you add more search operators such as phrase, within n words, followed by within n words (phrase = followed by within 1 word but it can be more optimized if done separately), etc... In fact you could end up with a much richer operator set and still be lightning fast.
I'm not sure about punctuation, I need to review his documentation first. I've done work on this kind of inverted index before and I'm itchin to see how he did that part! ;o) I've usually thought it desirable for text search engines to ignore punctuation anyway and just match words.
See http://beaver.dburry.com/cgi-perl/bible to see what I've done on inverted indexes before, specifically geared toward speed at all costs otherwise. It's not sword-based but I've been interested in integrating some of my stuff with sword for a long time (at least an import tool to convert sword modules to my index format), maybe this thing by Dave Orme will get me motivated! Or maybe I'll scrap my work and just work on his... I just want to see the best tool possible who cares about ego... ;o)