[sword-devel] search idea
Troy A. Griffitts
Sun, 26 Dec 1999 01:04:30 -0700
Great ideas for features. The place to start looking would be in class
SWModule. This is the base class for all SWORD modules and contains the
very basic search logic. To include extra index searching into the API,
you should probably, initially, try subclassing class RawText
(specifically for Bible texts), and override the SWModule::Search
method. On modules that you have indexed for your special searches, you
will need to specify your special module driver in the module's .conf
and look in SWMgr to add recognition of your new driver.
Then, if you think your logic may be generalized for module types other
than Bibles, we will abstract the search logic into it's own object and
have SWModule own an instance of a search object newed to a type
specified in the .conf file.
Let me know if you need help with understanding the API.
Trevor Jenkins wrote:
> On Thursday, 23 December, 1999 18:17:40, Daniel Glassey
> <email@example.com> wrote:
> > I don't know if any other programs can do this yet but it would be
> > great if you could do more complicated searches. Like search for
> > words in the same 'sentence' or 'paragraph' or 'section' as well as
> > verses and chapters. Boolean searches (I'm not certain but I don't
> > think it's in there yet) would be great as well.
> Because of my previous job (20+ years as a text retrieval specialist) I
> expect these features in any application that searches text. I'm frustrated
> by the Bible search programs I use currently (Online Bible for Mac and
> Compton's Interactive Bible*) because they don't offer sufficiently flexible
> search operators. And grep well it's okay for ad hoc searches but it isn't
> good enough for real work.
> What I expect to use are word proximity (this within n words of that),
> within sentence, within verse, within paragraph, within chapter, within
> book, within text. I'm assuming a biblical structure here but the "schema"
> could be extended to cover any textual structure. These should all be
> combinable with boolean operators of AND, OR, NOT, XOR. It is possible to
> see "within paragraph" as a restriction on AND.
> I can see a (minor) complication with the "within sentence" and "within
> verse" operators as there are a few occurences in the scriptures of a
> sentence that extends over several verses and a few occurences of a verse
> that extends over several sentences. The UBS NA27 text has a specific
> indicator of | to disambiguate this situation. Not insurmountable but could
> be troublesome.
> There would be design decisions such as what is the extent of the booleans?
> Should it be verse? (This is how OLB for Mac does it.) Then should the
> proximity measures extend across verse boundaries, sentences, chapters?
> (I've come across situations where these are arbitary limits.)
> How would search of non-latin texts be done? Greek and Hebrew. If nothing
> else how do you type a greek or hebrew glyph into the search request?
> Should there be a parallel search operator? That is can one search several
> texts in "parallel" and what should the result be? Are all matching verse
> found? Ought it to give the union of all matching verses from all (open)
> translations. Should commentaries be searched at the same time?
> > Does anyone know how these might be done?
> A grep-like scheme is not efficient for this task. So it is fundamental when
> a text is made available to SWORD for searching that that text must be
> scanned and all occurences of words identified. An index file is then
> written that lists each such word and all the position information (text,
> book, chapter, verse, sentence, word). Can you say "inverted file"? I knew
> you could. All lookups are then made in the inverted file rather than the
> text itself.
> So that searching an index is fast the words are hashed. Yes, I know that
> most of the text retrieval systems use B-trees for index files but they are
> wrong. Hashing is faster; remember Data Structures 101. Also there is no
> reason for the index files to extend as a general text retrieval needs. Even
> then it is possible to have extensible hashes. Per Ake Larson and others did
> lots of research on this topic in the mid-80s. The text retrieval system
> that I worked on Paralog's TRIP has extensible hashed index files. The trick
> is to ignore Maurer's stupid suggestion of using a prime number for the size
> of the hash table; instead you make it a power of 2.
> If the index file is hash then there is a knock on effect to pattern
> matching expressions in the search requests. Even that can be done using
> hashed index files. I know it can and I know how! That's what we did in
> TRIP. Can you say invert the invert file? I knew you could. :-)
> > This is a library level thing but would be great for bibletime as well.
> Of course, if it becomes a library level thing then none of the bets are
> off. It will work. Again we did it in TRIP. What an API level has is a
> general structure. The application (in our case SWORD) instantiates a
> specific schema (or schemas) for the translations.
> The real secret is how do you reduce the size of the inverted file(s). When
> I started in text retrieval the expection was that an inverted file involved
> a minimum of 150% overhead. It is now possible to have that down to 40%. Can
> you say compression? I knew you could. :-) And if you compress the words in
> the inverted file then it can be compressed even further---though personally
> I don't subscribe to that idea.
> There are optimisation to be had in the boolean and proximity operator
> implementation to.
> Now all this is a first pass and nothing that my "post viral fatigue" has
> interferred with. I'll need to be more clear headed to come up with
> specification and implementation details but for the moment the above notes
> will demonstrate that I've a solid grasp on the issues involved. Whilst
> you're all digesting them I'll be getting better. :-)
> Regards, Trevor
> British Sign Language is not inarticulate handwaving; it's a living
> language. So recognise it now.
> <>< Re: deemed!