[sword-devel] Search bug & New Arabic Bible, Not Shaped SVD Version
5001 at hotmail.com
Mon Nov 26 21:17:10 MST 2012
I'm not engaged in the design of Sword and JSword Engines but as i understand from the mentioned above, we depend on a library that get updates very frequently in java but no updates for its C port
If I'm right, Can we search for other libraries to use ?
The origin of the problem that Xiphos and other frontends can't search in Arabic when the source use diacritics and the search term doesn't contain any diacritics .
Thanks for your interest in solving the problem
> Date: Mon, 26 Nov 2012 08:42:33 -0600
> From: greg.hellings at gmail.com
> To: sword-devel at crosswire.org
> Subject: Re: [sword-devel] Search bug & New Arabic Bible, Not Shaped SVD Version
> On Mon, Nov 26, 2012 at 8:12 AM, DM Smith <dmsmith at crosswire.org> wrote:
> > Correct. JSword uses Lucene's filter for the language, which does more normalization than the StandardAnalyzer which SWORD uses exclusively. The StandardAnalyzer should only be used for "unaccented" latinate text. Same with the SimpleAnalyzer. (In Lucene, an analyzer is a filter chain which normalizes text. Rule-of-Thumb: the same should be used for both index construction and searching.)
> > Each release of Lucene adds and/or improves the filters for non-latin text.
> > The biggest problem with using a new version of Lucene is that it invalidates, without notice, prior indexes. An analyzer may change from release to release. It has been true of the StandardAnalyzer. The impact is that the number of search hits may be reduced, perhaps to 0.
> (Un?)Fortunately for SWORD it rarely will encounter this problem, as
> CLucene is extremely rarely updated. It has seen exactly two commits
> over the past 20 months (since the tagging of the 126.96.36.199 release,
> which is current head) and neither has been an update to the
> Analyzers. This has the benefit of not invalidating search indexes
> very often but has the drawback of almost never seeing updates to the
> analyzers and any bugs they may carry.
> It seems like we could have a set of Analyzers that we build on a
> per-language basis. The CLucene contrib libraries include analyzers
> specifically for German and CJK as examples. I doubt that the upstream
> maintainers would object to including additional analyzers if we
> developed them. That is, if we can even get in contact with them and
> they're not completely dormant.
> > Both SWORD and JSword need a mechanism to record the version of Lucene that is used in constructing an index and to refuse to search an index unless the version of Lucene for searching and indexing match.
> Much noise has been made about this. But no one has been willing to
> actually implement it or been rebuffed when proposals have been made
> as to how this might be stored. Nearly any changes made would still
> lead to invalidation of existing indexes, against which there has been
> much friction in the past. Storing the value in a file next to the
> indexes is a near-trivial change, but no one has done so.
> To avoid this current issue, though, would it be better to track the
> Lucene version or the Analyzer version used? From what I know of
> Lucene, some sort of hybrid of the two might be best. My understanding
> is that some versions of Lucene break compatibility with indexes made
> in previous versions, while the current issue would be addressed by
> filter changes which should be applied to both the index and incoming
> search terms.
> Again, implementing this is a near trivial task (although
> compatibility between the indexes created in C and those in Java would
> probably not be possible because the Java Lucene library is much more
> active than CLucene). It's simply never been a priority for anyone to
> > Also of note, there have been some substantial changes to Unicode from release to release. So, if the version unicode used by the OS, Java, ICU, .... changes, the index may no longer be valid. From what I can tell this will be minority languages.
> > In Him,
> > DM Smith
> > On Nov 26, 2012, at 7:22 AM, Peter von Kaehne <refdoc at gmx.net> wrote:
> >>> Von: David Haslam <dfhmch at googlemail.com>
> >>> So a similar patch would be necessary in principle to JSword ???
> >> No. If And Bible does not have a problem, then Jsword does its job correctly.
> >> Peter
> >> _______________________________________________
> >> sword-devel mailing list: sword-devel at crosswire.org
> >> http://www.crosswire.org/mailman/listinfo/sword-devel
> >> Instructions to unsubscribe/change your settings at above page
> > _______________________________________________
> > sword-devel mailing list: sword-devel at crosswire.org
> > http://www.crosswire.org/mailman/listinfo/sword-devel
> > Instructions to unsubscribe/change your settings at above page
> sword-devel mailing list: sword-devel at crosswire.org
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the sword-devel