[sword-devel] Re: [bt-devel] fast search

Trevor Jenkins sword-devel@crosswire.org
Fri, 9 Feb 2001 13:59:09 +0000 (GMT)

On Thu, 8 Feb 2001, David Burry <dburry@tagnet.org> wrote:

> order-preserving hash indexing algorithm 

I don't understand what an "order-preserving hash" means. 

> allows complex searches such
> as one operand within so many words/verses/chapters/etc of another
> operand (this is not all yet functional but it's coming trust me).  

That's just the type of feature I think should be available as part of a
"fast search". Plus the boolean operators. I'm not convinced of the need
for relevance ranking (either in sword or in general).

> The downside of what I've written there is that the indexes are about
> 5 times the size of the original data.

Sounds as if your keeping the position pointers uncompressed. The
commerical system I helped to develop got the total size of the inverted
files (we had two for each base file) down to about .7 of the size of the
base file. We did this partly by removing reduncancy in position pointers
and by not using fixed width integers.

If we were to index this paragraph of text then the position pointers in
the index for "index" would be:

<parax, sent1, 5>
<parax, sent1, 16>
<parax, sent1, 18>

we squeezed it to be <parax, sent1, 5>, <15>, <18>. For the word "the" we
would store:

<parax, sent1, 11>, <15>, <sent2, 2>, <4>

rather than:

<parax, sent1, 11>
<parax, sent1, 15>
<parax, sent2, 2>
<parax, sent2, 4>

If you store the pointers in 32-bits ints then "index" requires 9 ints
uncompressed ours requires only 5. For "the" we go from 12 down to 7.
However, we had a "binary point" representation for integers that squeezed
them into even less bits (basically only as many bits as is necessary to
represent them). When you do this on megabytes of text the compression is
dramatic, on gigabytes sensational. :-)

> yep yep, bought and devoured that book "Managing Gigabytes"  ;o)

You know as much as me then. :-) Except I read MG to find out about latest
trends. They advocate compressing both inverted and base files.

> ... I thought it was interesting that they danced around it and almost
> touched it a few times, but never really nailed my particular
> algorithm,

Do you have the first or second edition?

> ... maybe I should write a masters thesis... lol

Maybe. :-) 

Regards, Trevor

British Sign Language is not inarticulate handwaving; it's a living language.
Support the campaign for formal recognition by the British government now!


<>< Re: deemed!