[sword-devel] Announce: Sword/PDA for the Agenda PDA

David Burry sword-devel@crosswire.org
Tue, 23 Oct 2001 10:10:08 -0700


At 09:42 PM 10/22/2001 -0500, David J. Orme wrote:
>David Burry wrote:
>>I'm not sure about punctuation, I need to review his documentation 
>>first.  I've done work on this kind of inverted index before and I'm 
>>itchin to see how he did that part!  ;o)  I've usually thought it 
>>desirable for text search engines to ignore punctuation anyway and just 
>>match words.
>In my code, punctuation is treated like a word; each punctuation mark gets 
>its own entry in the dictionary file, ....  See the docs / code for 
>details.  Actually, the code that tokenizes the Bible into "words" for the 
>dictionary is generated using flex, so you might just want to dig into that.

Ohh...  I see...  I was trying to make it possible to do a "within so many 
words" operator with the same index, that's why I didn't make punctuation 
take up its own entry... of course, I wasn't programming for a tiny memory 
constrained PDA, I was doing it for a client-server model (the web) and 
it's easy to just drop in another couple gigs into a web server.... ;o)  In 
fact I didn't even compress the inverted index at all, used Berkley DB 
(B+tree) instead which is quite wasteful of space (makes it take about 25 
megs per version) but it sure is ***blazing***fast*** as a result!!!  My 
reasoning for doing this was that for a very high traffic web site I'd 
rather use a little extra hard drive space than make people wait around 
even an extra half second.  My philosophy is that people don't get faster 
computers and high bandwidth connections because they want their software 
and web sites to get slower and still run at the same old speed, they want 
enhanced performance instead.  I was just trying to give people that wish 
without any expense on the client side necessarily... ;o)  If anyone knows 
of more space conservative ways of achieving these goals, I'm all ears...

Dave