[jsword-devel] Re: Search and its bugs

DM Smith dmsmith555 at gmail.com
Sat Apr 9 07:07:51 MST 2005


Well, I have played a little bit with doing a simple ranked search
using lucene and using best match using j-sword's current algorithm.
For the search, "bread of life" in the top 20 verses, there were 14
agreements, and these were lucene's top 14.

It looks like the difference was that lucene used "of" and j-sword did
not. Given this I think I am going to change the Match check box to
Rank, add it to AdvancedSearch with a slider indicating how many
verses to rank or whether to rank all hits.

Then I am going to add SearchSyntax.

On Apr 8, 2005 11:42 PM, DM Smith <dmsmith555 at gmail.com> wrote:
>  From looking at Lucene, it appears that the only thing that it does not do
> that JSword currently does is "blur". To do range searching in Lucene, we
> would need to index the ordinal verse value along with the verse. Then > and
> < can be used to do the range search. The problem with storing the ordinal
> value is that it will make alternate versification harder. Perhaps a better
> way would to encode the reference as something like:
>  verse + Book * 10 + chapter * 1000 + testament *1000000 (not exactly these
> powers of 10, but the smallest ones that would make it work.)
>  
>  Back on Blur: Lucene uses ~ as a word suffix operator. So if we required ~
> and ~n (where n is the blur factor) to be surrounded by whitespace, we could
> use lucene to do the "halves" and combine the operations using logical AND.
>  
>  Best match also is affected, but it may not be significant. The way it is
> currently it does a fuzzy match on non-stop words in a phrase and weights
> them differently than the straight search. I think that Lucene already does
> weighting that takes the fuzziness into account. I think it would be good to
> do some comparison to straight lucene fuzzy match to see if the results are
> significantly different. If they are not significantly different, we would
> only need to account for blur and ranges and can create a simple parser that
> would split searches with blur and ranges into parts and submit each part.
>  
>  If this becomes the case, then javacc would be overkill.
>  
>  Advanced search would become more important as it would help build complex
> searches. And the SearchSyntax becomes more helpful.
>  
>  I think I will start by creating SearchSyntax and applying it to the
> existing code. Once that is done, we can then play with other search engines
> to see what is better.
> 
>  
>  Joe Walker wrote: 
> 
>  Having a SearchSyntax sounds like a good idea to me.
>  
>  It would be good if we could implement it using Lucene, we've talked about
> using their query parser in the past.
>  
>  The problems of the search query parser probably come down to the way it
> has evolved, which seems to be a common pit-fall for any parser code - the
> pattern seems to be that the parser evolves to the point where squashing
> bugs becomes too regular and then someone sits down and writes a grammar for
> it. I noticed that Groovy has just been through this.
>  I've dabbled with javacc successfully on a couple of projects, and once
> tried to write a COBOL grammar - very unsuccessfully so I know it can be
> hard. This may well be overkill for our simple syntax?
>  
>  Other than that, go for it!
>  
>  Joe.
>  
>  
>  
> On Apr 8, 2005 12:52 PM, DM Smith <dmsmith555 at gmail.com> wrote: 
> > I've narrowed down some of the bugs of search. Seems that the tokenizer
> > is not producing the correct stream of tokens.
> > Specifically, the algorithm using the tokens goes something like this:
> > 
> > while there are command tokens at the beginning of the stream get next one
> > do
> >     have that command consume word tokens until it reaches a terminating
> > condition
> > done
> > 
> > The problem of +[mat-rev]"bread of life" is that this produces a token
> > stream where +[mat-rev] is not followed by a command token.
> > 
> > In looking at this I noticed that there is what looks like a design
> > problem. Consistently, elsewhere in JSword, an interface defines a wall
> > that BibleDesktop and JSword does not look behind. However in the case
> > of searching this is not the case.
> > 
> > jsword.book.search
> >     provides the interfaces for Search and Index and factories to get
> > implementation
> > jsword.book.search.basic
> >     provides abstract/partial implementation of the interfaces
> > jsword.book.search.parse
> >     provides an implementation of Searcher
> > jsword.book.search.lucene
> >     provides an implementation of Indexer
> > 
> > Based upon this I would have expected that no code (outside of the
> > package) would have directly used jsword.book.search.parse code.
> > 
> > The reason I noticed this was that I wanted to create another searcher
> > and get it from the search factory. (Start with a copy and fix bugs,
> > while retaining the ability to use BibleDesktop by changing the
> > factories properties.)
> > 
> > What is being used is the syntax elements to pro grammatically construct
> > a search. I'm thinking that we need YAI (yet another interface) for
> > SearchSyntax. This would be able to:
> > 1) decorate individual words and phrases with appropriate syntax elements.
> >     SearchSyntax ss =
> SearchSyntaxFactory.getSearchSyntax();
> >     String decorated = ss.decorate(SyntaxType.STARTS_WITH, "bread of
> life");
> >     decorated = ss.decorate(SyntaxType.FIND_ALL_WORDS, "son of man");
> >     decorated = ss.decorate(SyntaxType.FIND_STRONG_NUMBERS, "1234 5678");
> >     decorated = ss.decorate(SyntaxType.BEST_MATCH, "....");
> >     decorated = ss.decorate(SyntaxType.PHRASE_SEARCH, "....");
> >     ...
> > 
> > 2) create a token stream from a string.
> >     Token[] tokens = ss.tokenize("search string");
> >     or
> >     TokenStream tokens = ss.tokenize("search string");
> >     or
> >     ...
> > 
> > 3) serialize a token stream to a string.
> > 
> > Input desired!
> > 
>  
>


More information about the jsword-devel mailing list