[sword-devel] Search bug & New Arabic Bible, Not Shaped SVD Version

Mon Nov 26 07:42:33 MST 2012

On Mon, Nov 26, 2012 at 8:12 AM, DM Smith <dmsmith at crosswire.org> wrote:
> Correct. JSword uses Lucene's filter for the language, which does more normalization than the StandardAnalyzer which SWORD uses exclusively. The StandardAnalyzer should only be used for "unaccented" latinate text. Same with the SimpleAnalyzer. (In Lucene, an analyzer is a filter chain which normalizes text. Rule-of-Thumb: the same should be used for both index construction and searching.)
>
> Each release of Lucene adds and/or improves the filters for non-latin text.
>
> The biggest problem with using a new version of Lucene is that it invalidates, without notice, prior indexes. An analyzer may change from release to release. It has been true of the StandardAnalyzer. The impact is that the number of search hits may be reduced, perhaps to 0.
>

(Un?)Fortunately for SWORD it rarely will encounter this problem, as
CLucene is extremely rarely updated. It has seen exactly two commits
over the past 20 months (since the tagging of the 2.3.3.4 release,
which is current head) and neither has been an update to the
Analyzers. This has the benefit of not invalidating search indexes
very often but has the drawback of almost never seeing updates to the
analyzers and any bugs they may carry.

It seems like we could have a set of Analyzers that we build on a
per-language basis. The CLucene contrib libraries include analyzers
specifically for German and CJK as examples. I doubt that the upstream
maintainers would object to including additional analyzers if we
developed them. That is, if we can even get in contact with them and
they're not completely dormant.

> Both SWORD and JSword need a mechanism to record the version of Lucene that is used in constructing an index and to refuse to search an index unless the version of Lucene for searching and indexing match.
>

Much noise has been made about this. But no one has been willing to
actually implement it or been rebuffed when proposals have been made
as to how this might be stored. Nearly any changes made would still
lead to invalidation of existing indexes, against which there has been
much friction in the past. Storing the value in a file next to the
indexes is a near-trivial change, but no one has done so.

To avoid this current issue, though, would it be better to track the
Lucene version or the Analyzer version used? From what I know of
Lucene, some sort of hybrid of the two might be best. My understanding
is that some versions of Lucene break compatibility with indexes made
in previous versions, while the current issue would be addressed by
filter changes which should be applied to both the index and incoming
search terms.

Again, implementing this is a near trivial task (although
compatibility between the indexes created in C and those in Java would
probably not be possible because the Java Lucene library is much more
active than CLucene). It's simply never been a priority for anyone to
do.

--Greg

> Also of note, there have been some substantial changes to Unicode from release to release. So, if the version unicode used by the OS, Java, ICU, .... changes, the index may no longer be valid. From what I can tell this will be minority languages.
>
> In Him,
>         DM Smith
>
>
> On Nov 26, 2012, at 7:22 AM, Peter von Kaehne <refdoc at gmx.net> wrote:
>
>>
>>> Von: David Haslam <dfhmch at googlemail.com>
>>
>>> So a similar patch would be necessary in principle to JSword ???
>>
>> No. If And Bible does not have a problem, then Jsword does its job correctly.
>>
>> Peter
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page