[sword-devel] Fast search -- some ideas

Trevor Jenkins sword-devel@crosswire.org
Thu, 31 Aug 2000 12:38:47 +0100


On Wednesday, 30 August, 2000 08:36:48, Nathan <mail@nathan.co.za> wrote:

> There seems to be quite an interest in the fast
> index searching, so I thought I would write up
> some of my ideas. Maybe you can use this as
> discussion points, disagree, improve, change it
> or ignore it. :-)

Having an interest in this I'll chime in with comments. I've been putting a
separate proposal together as and when my CFS allows. (For those who don't
know or have forgotten I worked on such document indexing systems for 20
years.)

Anyone wanting to get to grips with the underlying concepts I'd recommend
the excellent text "Managing Gigabytes" by Witten, Moffat and Bell. The
second edition was published late last year by Morgan Kaufman.

> Step 1. Building the wordlist
>
> The first thing to do is get a list of all the
> words in the file. This means reading through
> each verse/paragraph and identifying each word.

There are some oddities in the Bible text. For example, paragraphs that are
split across verses, verses that contain the end of one paragraph and the
start of the next, verses that include complete sentences, sentences that
cover several verses. Such issues become important for colocation searches
(e.g. within sentence, within verse, within paragraph, within N words of,
word adjacency).

> This is fairly easy, as one can just convert
> all punctuation characters to spaces, and then
> get a list of the remaining words.

But then you'll loose colocation.

> Keep in mind other languages have many other accented
> characters, that you do not see those as
> punctuation!

Doesn't that go without saying? :-)

> Also keep the ' in mind e.g. "isn't"
> is a valid word (else you chop it into "isn" and "t").

The scanning phase could have rules that help distinguish this. Of course,
what one wants is for "isn't" to be indexed as the two (adjacent) words "is
not". [Note the importance of colocation.]

> Also convert the words to lower case. You can
> always check for case sensitive or insensitive
> at run-time (once you have found the verse/paragraph).

There are very few instances where case needs to be considered. Some
"commercial" competitors of Sword distinguish between words like LORD and
lord making it very difficult to find passages that one does not remember
the typographic conventions used in a particular translation.

> You might want to do some "stemming" (reducing the
> number of words) by removing all words ending
> with 's. You might want to do more stemming
> e.g. removing the "-ly" at the end of words, or
> eliminating plurals (most ending with "s") A lot
> depends on your final view. Most times you will
> want to give the person searching both the singular
> and plural of a word, without them having to worry
> about specifying both. There is plenty of work one
> can do here.

There are some languages (and English is one of them) where conjugation of
verbs cannot be handled by a simple stemming scheme as you describe. When
working with other Latin languages (eg Finnish) there are real problems with
this.

> Step 2. The concordance
>
> The format of the concordance can be in any of the
> 3 methods that Joe described (use the best method
> for that specific word).
> 1. A list of pointers to the verse/paragraph: ...
> 2. A range list: ...
> 3. A "bitmap": ...

Because most words occur more than once even within a single
verse/sentence/paragraph you can compress the position pointers to indictate
that the terms appear are in the same V/S/P as one another.

> PS. Make sure the wordlist is sorted. It speeds
> things up (about 8 times in most cases).

Data structures 101 will have informed you that searching a sorted list has
O(n/2). Hash tables, with a well chosen hash function, appraoch O(1).


> Step 3. The search

> Expressions: Here some parsing is needed. This is when
> the person asks
> e.g. "((Jesus and Son) not Father) or Christ*"
> The expression will have to be parsed,

Yes but not

> e.g. into BNF,

One could express the search language in BNF but what you probably meant was
RPN (reverse polish notation)

> The NOT part is the difficult one here :)

Do you consider NOT to be monadic or diadic?

> Proximity: This is when you want to have some words, but
> they can span +/- 5 verses. You know the words are
> close together, but not in the same verse/paragraph.
> This is obviously more difficult, and that is why few
> programs have it. I have not looked at this as yet either.
> Just mentioning it here for completeness.

As might be guessed from my earlier comments this is an area that I have
given a lot of thought to what is involved. :-)

> Other languages: Something To think about as well is to make
> it easier for people searching other languages. This would
> imply that people could type in "seen" and get "seën".
> (Afrikaans word for "bless"). Sort of accent-insensitive.
> Helps a lot with languages that have many accents.

This is similar to the stemming problems I mentioned above. (The commercial
system I worked on had a special "morphological analysis" module to deal
with this issue.)

> Natural language queries: e.g. "Where in the Bible can I
> read the 10 Commandments?" Skip this for now... <grin>

I see this as a separate layer between the user interface and the search
expressions.

> Spelling mistakes: If a word is not found, I was thinking
> on using a "soundex" type routine to get the nearest words
> to it. (soundex will associate "kayotik" with "chaotic")
> I don't know if soundex is the correct one. There are better
> routines. (Might even use something like PPM they use with
> natural language queries.) Then suggest an alternative to
> the person.

Probably the best one is NYSIIS rather than Soundex. But where are the
spelling mistakes coming from? Hopefully never the text but rather finger
trouble on the part of the user. For the former I say clean up the text for
the latter I say do nothing :-| let them note their mistake and take
corrective action themselves (retyping the correct terms or using
wildcards).

> You need to look at this, especially with the way people
> spell "color" vs. "colour", etc. The KJV uses "colour".
> (I could throw in some flame-bait here about
> "proper" English <grin>)

Thou shalt not. :-)

> Step 4. Ranking

Hate it. Don't like it. Never use it. Precision/recall studies haven't
demonstrated that ranking really works (for the end-user).

> The ranking is then done on a "majority rules" method.
> The verse with the most occurrences get the top spot.
> A better method would be to look at
>   1. The book/chapter with the most hits
>   2. Then the verse with the most hits.

Moffat/Witten/Bell discuss various weighing schemes. As you can see I'm no
fan of ranking so I leave it as an exercise to the reader (of MWB) to decide
what scheme might be appropriate.

> Step 5. Optionally compressing the text
>
> Because all searching is basically done in the index,
> the text does not have to be in uncompressed format
> any more, and can be compressed. When a verse needs to
> be displayed on the screen, it is uncompressed on the
> fly and then displayed. This way you can get the entire
> KJV plus index in just over 2 Mb.

With full colocation information (in a compressed index file) one can do
away with the text completely. Okay so there is more work to be done when
displaying the text but it's an option.

> I thought these verses were sort of applicable... :)

You forgot one:

"These were more noble than those in Thessalonica, in that they received the
word with all readiness of mind, and searched the scriptures daily, whether
those things were so." Acts 17:11 AV.

:-)

Regards, Trevor

British Sign Language is not inarticulate handwaving; it's a living
language. So recognise it now.

--

<>< Re: deemed!