[jsword-devel] Extending Lucene Indexes and stemming in particular

Chris Burrell christopher at burrell.me.uk
Mon Apr 21 10:24:41 MST 2014


So a few things on this.

*0- STEP's use case*
Apologies for not explaining this properly before. STEP has a facility to
allow people to search by topic/subject. We have auto-completion, meaning
there are a lot of matches. We search for topics across a couple of
different data sources (Books of the Bible, in particular the ESV, and
Nave's concordance). When the user is looking for 'love' or 'brother', in
search for a particular passage, he cares little about whether we source
the data from an underlying Bible, or a Nave concordance module, or some
other datasources, so when STEP displays 'love' we only display one option.
In order to group options accurately across datasources in a way that
minimizes the possible options, we use the stem as the grouping factor (in
pseudo-SQL terms it would be something like: select heading from Bible
union select topHeading from Nave) group by stem(term). In other words we
display a term found in the index, but all its derivatives are hidden until
the user elects to see them.

[image: Inline images 1]

The different markers (H, N, N+ indicate Bible Headings, Nave top-level
headings, Nave extended headings). As you can see, there is no 'lovely'.
That comes under 'love'. And there is no 'loved', that also comes under
'love'. So that's the use case I'm trying to solve.

*1- Performance penalties on Lucene 3 indexes with Lucene 4 engine. [worth
noting]*
There are various posts on the internet that suggest that the Lucene 4
engine penalises the use of Lucene 3 indexes.
http://lucene.472066.n3.nabble.com/Lucene-Index-backward-compatibility-related-question-td4003520.htmlsuggests
that using old indexes incurs a small penalty. On the other hand,
various posts suggest a good performance improvement from using 3.6+,4.0,
etc. So perhaps the penalty is out-weighed by the rest. In particular, one
of the changes mentioned in Lucene 4 is an improvement in the minOccurs

*2- Supporting multiple Lucene versions*
I think what DM was suggesting earlier would make sense.
Step-1a rip out the Lucene 3 code into a separate jar file (for use in AB
going forward until AB has a suitable upgrade policy/path)
Step 1b- Ensure the interface is nice and clean
Step-2 write a Lucene 4 indexing capability with the same interface.
Step-3 Make this Lucene 4 implementation the new default.

I'm assuming that's what DM was suggesting?

*3- Supporting multiple index configurations*
This is slightly separate to sharing the same implementation, although
related. At the moment, as far as I can see, a frontend can elect to turn
some parts off, or on using the IndexPolicyAdapter. However, he has no
control over how things get indexed (e.g Analyzed, Stored, etc.) . I'm not
sure that's necessarily a big blocker, so long as we accept that by default
we want everything, as configured.

There have relatively few changes here. The changes in the last year, that
I remember are: intros, morphology, heading stem, fixes to the strongs'
index field, ... others? Apart from the fix to the strong number indexing,
all of these changes have been backwards compatible.

*4- Shared indexes*
I think we mostly agree that it would make sense to have separate indexes
per front-ends. At a minimum this will likely mean index locations change.
Having said that, we could keep the current location as the default
un-customized version. I don't think this has been an issue so far as BD is
mostly the only desktop application around.

DM>Right now, if we use the same analyzer for search and for index
construction across all fields, we can share the same indexes even if we
use different index policy adapters.
CJB> I'm not sure this is true? If BD is configured to disable headings in
the policy adapter, but STEP has headings enabled in the Policy adapter,
then the index will contain different content. I.e. STEP will lack headings
if BD created the index. BD will not suffer, if STEP has created the index.
This would then break the auto-completing aggregation aforementioned. Or am
I mis-understanding something?


Chris









On 21 April 2014 17:44, DM Smith <dmsmith at crosswire.org> wrote:

>
> On Apr 21, 2014, at 7:11 AM, Martin Denham <mjdenham at gmail.com> wrote:
>
> I don't want to be seen as a 'stick-in-the-mud' regarding index
> improvements so could I emphasize that STEP and AB requirements are very
> different and I suppose most desktop apps like BD are probably somewhere in
> the middle:
>
>
> BD is some where in the middle. We've always had the aim of supporting
> missionaries and pastors that have old, hand-me-down, underpowered,
> under-resourced machines. As such we support very old variants of Windows.
> We still will run on Windows 98se. And fairly old versions of Mac OSX.
> (This has held back which version of Java that we use.)
>
> We didn't implement downloadable indexes precisely because we were not
> willing to solve the versioning problem.
>
> Because of AndBible, we need to solve it. Looking forward to our using
> Sijo's improvements in this regard to identify what works and what doesn't.
>
> AndBible ultimately needs to have a plan and solution for moving forward.
> Your suggestion below seems to be reasonable.
>
>
> AB
> Indexes all over the world
> Low powered devices
> Need small indexes
> Need to have fast index generation
> Need low memory and storage requirements
> I have no direct access to devices
> Users normally have no technical experience
> Users very happy with current search functionality
> Need backward compatibility
> Download speed very slow in general 2G/3G and costs money
> Frequent connection problems depending on country, provider
>
> STEP
> Indexes at single centralized location
> High powered server
> Index size not a factor
> Index generation only occurs once
> Lots of RAM and disk space available
> Experienced dev-op (Chris)
> Regeneration simple
> Pressing for enhanced functionality
> No need for backward compatibility
> Instant access to server
>
> I realise Sijo is implementing upgrade functionality but even then it will
> still not be a simple upgrade for AB given the architecture but STEP would
> not even need to use the new upgrade code.
>
> *Questions*
> I haven't followed all of the preceding discussion, partly because the
> finer details of Lucene have beaten me, so could I ask for clarification of
> some of the changes.  There seem to be 3 changes being discussed:
>
> *New code to support index upgrades* (Sijo)
> I understand most of this.  It looks useful.  I am hoping to submit a
> simple change/suggestion for the download index method compatible with
> this.  Index upgrades should have a deprecation period when old indexes
> work but new indexes are available for download or generation.
>
>
> Yes we need to support a reasonable upgrade path.
>
>
> *Changes to the generated indexes to support different search methods*
> I got a bit lost in the detail here.  Is this to allow enhanced STEP
> specific functionality or a required change for basic JSword searches.  If
> it is for STEP could this be handled via IndexPolicyAdapter.
>
>
> I'm not sure of the specific reason that stemming is useful for headings.
>
> Earlier, we provided a mechanism to provide alternative analysis of the
> body content per language. This allowed one to tailor analysis with regard
> to stemming and/or stop words. Or more advanced choices such as which
> method to do analysis for Chinese.
>
> When we added the capability, we did it just for the body, but not for the
> other text fields, such as headings and notes. And when intro was added it
> just took the default analysis.
>
> It appears that we need a mechanism to specify per field capabilities. The
> upshot of allowing each application to tailor how it does an index is that
> it makes sharing indexes much, much harder.
>
> Right now, if we use the same analyzer for search and for index
> construction across all fields, we can share the same indexes even if we
> use different index policy adapters. If we tailor stemming and stop words
> on a per app or user basis, then we cannot share unless we have a manifest
> that declares to an application how to build a compatible search analyzer.
> We don't have that. The Solr project has such a mechanism and we could
> model after it. But for one application to use the indexes of another it
> cannot presume that all indexes are built the same. (Which is true today
> even within an app). That is the approach that I take in Bible Desktop. A
> user can search Strong's numbers or xrefs in a module that does not have
> them. I've not gotten complaints. So I assume that it is not confusing.
>
>
> *Upgrade of Lucene*
> I realise STEP and AB have different leanings on this because of different
> architectures.  Which version of Lucene is it currently being planned to
> move to as various versions were discussed, some of which have a modified
> api and incompatible indexes, some don't.  If the target version of Lucene
> is incompatible then DM's suggestion will hopefully work but will it be
> possible to isolate api differences sufficiently to use the plugin
> architecture.
>
>
> The issue with Lucene is that we upgraded from 2.x to 3.0 and then 3.0.3.
> Up to that point Lucene had a strict compatibility policy that a major
> series had index backward compatibility from start to finish. So 2.0 to 2.9
> were index compatible. The next major version would be backward compatible.
> So 3.0 could read 2.x indexes. The only difference between 2.9 (the last
> release of a major version) and 3.0 (the first of the next major version)
> was the removal of deprecations. This allowed for a clean upgrade from one
> major version to another.
>
> But shortly after 3.0 the Lucene core committeers voted to modify their
> policy, but came up with a mechanism to allow for backwards compatibility
> to a particular release. Basically, they noted that they could do a bunch
> of major version releases or they could add breaking changes with in a
> major version number. They decided the latter. Now if you want backward
> compatibility with 3.0 you'd use the o.a.l.util.Version.LUCENE_30 as an
> argument in the construction of Filters and Analyzers.
>
> Supposedly, the Lucene 4.0 release will have backward compatibility to 3.0
> if LUCENE_30 is used.
>
> If you specify LUCENE_30, then there are some features that might not be
> available.
>
> Somewhere in the 3.x series they added an entirely different internal
> architecture which would require a major re-write of our Filters and
> Analyzers. I started it a couple of times, but got distracted by other
> things each time.
>
> The 3.x releases were 3.0, 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6. 3.6 is the
> terminal release of the 3.x series. The 4.0 release should be nearly
> identical to the 3.6 release with deprecations removed. I don't know that
> it makes any sense to do 3.1 to 3.5. I'm not sure that it makes sense to do
> 3.6 and release it, but it would make it easier to go from 3.0 to 4.0 to
> use that as an intermediate. Sijo will decide.
>
> It might make sense to do the most recent release in 4.x.
>
> But it should be mostly transparent to our applications and it should
> support backward compatibility.
>
> -- DM
>
>
> Regards
> Martin
>
>
>
> On 21 April 2014 04:28, Sijo Cherian <sijo.cherian at gmail.com> wrote:
>
>> Thanks DM for explaining this far. The plugin configuration is nice way
>> for index customization.
>> As we extend our index fields, we should make it easier for the api user
>> to see all index fields present, and analyzer used for each.
>>
>> I am working on getting lucene upgrade functionality done.
>>
>>
>> On Sun, Apr 20, 2014 at 3:22 PM, DM Smith <dmsmith at crosswire.org> wrote:
>>
>>> He is risen!
>>>
>>> I haven't pulled the push request, atm. I think we need a bit more
>>> discussion. We are close.
>>>
>>> Indexing/searching is specified via interface and implemented via
>>> plugins. The IndexManager.plugin, QueryBuilder.plugin,
>>> QueryDecorator.plugin and Searcher.plugin. AnalyzerFactory.properties that
>>> Sijo mentioned is also a critical part. There may be a few others.
>>>
>>> There is no problem with AndBible having a different Index
>>> implementation (i.e. the current one) if we create the new one with a
>>> different name. AndBible will need to have a jar with the old
>>> implementation. JSword will provide the new implementation.
>>>
>>> This plugin mechanism was provided to be able to swap out one
>>> implementation for another during development, but can serve this purpose
>>> well.
>>>
>>> DM
>>>
>>>
>>> On Apr 20, 2014, at 3:39 AM, Chris Burrell <chris at burrell.me.uk> wrote:
>>>
>>> Hi Sijo
>>>
>>> That wouldn't do what I want. I need the non stemmed body content and a
>>> separate stemmed heading field.
>>>
>>> Even if I did want the stemmed body, I would want it in addition to the
>>> non stemmed body.
>>>
>>> As I said, happy to remove the other ones. They were put in at DM s
>>> suggestion.
>>>
>>> Chris
>>>  On 20 Apr 2014 03:09, "Sijo Cherian" <sijo.cherian at gmail.com> wrote:
>>>
>>>> Chris,
>>>> Since we already have a language based Analyzer configuration, if you
>>>> can provide a custom jsword/src/main/resources/AnalyzerFactory.properties
>>>> in STEP and add custom config for english like this:
>>>>
>>>>
>>>> en.Analyzer=org.crosswire.jsword.index.lucene.analysis.ConfigurableSnowballAnalyzer
>>>>
>>>> This will stem the "content" field, both during indexing & query. Can
>>>> you override prop files in your classpath, easily?
>>>>
>>>> Regarding your requirement to stem the heading: Since the current impl
>>>> for "heading" uses the default analyzer, you will have to change prop
>>>> "Default.Analyzer" to snowball, but that will have bigger impact - uses
>>>> snowball for all other fields.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Apr 19, 2014 at 4:14 AM, Chris Burrell <chris at burrell.me.uk>wrote:
>>>>
>>>>> I don't mind configuration so long as these indexes are stored
>>>>> separately per app.
>>>>>
>>>>> STEP relies on stemming and in places it uses it, we can't ask the
>>>>> user, nor does it make sense there. So things would break and be quite hard
>>>>> to debug.
>>>>> Chris
>>>>> On 19 Apr 2014 06:13, "Sijo Cherian" <sijo.cherian at gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Great discussion. isProgress.
>>>>>>
>>>>>> I am still pondering all the benefits of double indexing the entire
>>>>>> content.
>>>>>>
>>>>>> For specialized users, who don't want stemming factor in their
>>>>>> searching: Can we provide a API for them to specify param like noStemming,
>>>>>> noLowercase etc at the time of indexing on per-book basis, and persist
>>>>>> those metadata in property file. Use exact  property during query analysis.
>>>>>> These users probably won't want auto-reindexing on major jsword upgrade.
>>>>>>
>>>>>> Easter is almost here!
>>>>>> -sijo
>>>>>> On Thu, Apr 17, 2014 at 8:40 PM, DM Smith <dmsmith at crosswire.org>wrote:
>>>>>>
>>>>>>>
>>>>>>> On Apr 17, 2014, at 12:09 PM, Chris Burrell <chris at burrell.me.uk>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hello
>>>>>>>
>>>>>>> STEP uses stemming to improve search results, in some queries
>>>>>>> (whether on Sword modules or otherwise).
>>>>>>>
>>>>>>>
>>>>>>> Stemming is very useful. On occasion, there is a need for a
>>>>>>> non-stemmed search. Especially for theological purposes. But for general
>>>>>>> purpose searching it should be the default.
>>>>>>>
>>>>>>> I've some times thought it'd be good to double index: stemmed and
>>>>>>> full word.
>>>>>>>
>>>>>>>
>>>>>>> There are currently 2 limitations in JSword, both of which could
>>>>>>> easily be fixed. Please let me know if you have concerns around me
>>>>>>> implementing both.
>>>>>>>
>>>>>>> a- the frontend can't extend/control the use of indexes. I'm
>>>>>>> suggesting we add a registerFieldIndexer(fieldIndexer) with a simple
>>>>>>> interface: indexField(doc, osis). This would allow frontends to specify its
>>>>>>> own indexing. This would allow a frontend to index new things, or enable
>>>>>>> term vectors / store fields, etc.
>>>>>>>
>>>>>>>
>>>>>>> I'd really rather that we didn't go down this route. I don't mind
>>>>>>> plugin architecture as a way to experiment with different techniques, but
>>>>>>> I'd really rather that we all benefit from the changes.
>>>>>>>
>>>>>>>
>>>>>>> b- Extend the LuceneIndex to have a stemmed version of the heading.
>>>>>>> We could replace the existing index, but that would mean all frontends will
>>>>>>> require re-indexing.
>>>>>>>
>>>>>>>
>>>>>>> I think the same manner that we index the main verse text should be
>>>>>>> applied to all text: intro, heading and verse text.
>>>>>>>
>>>>>>>
>>>>>>> c- Had JSword been configured to 'STORE' the content of some fields,
>>>>>>> I would have used that for headings. For example, if the headings is stored
>>>>>>> in the index, STEP would not need to do an osis extract and XML transform
>>>>>>> to display to the user. It could come straight from the index. Two
>>>>>>> possibilities here: change the existing index field configuration, or
>>>>>>> duplicate into a different field.
>>>>>>>
>>>>>>>
>>>>>>> I think we should make store an option, possibly the standard.
>>>>>>>
>>>>>>> Right now the way we do the index prevents us from using Lucene to
>>>>>>> highlight the search hit. If that is STORE, then I'd be in favor of making
>>>>>>> STORE standard. I wonder if our stripping the text to no include OSIS
>>>>>>> before indexing will frustrate this change.
>>>>>>>
>>>>>>> It still should be an option for the sake of devices that are disk
>>>>>>> limited.
>>>>>>>
>>>>>>> d- the other side of c- is that ideally multiple headings should be
>>>>>>> stored in multiple entries to the same field, rather than a concatenation
>>>>>>> of the field (doesn't much matter if it's only ANALYZED)
>>>>>>>
>>>>>>>
>>>>>>> Some verses have headings in the middle of the verse. Don't make the
>>>>>>> mistake of assuming an order of heading. Or that heading contains only
>>>>>>> pre-verse material or all pre-verse material.
>>>>>>>
>>>>>>>
>>>>>>> *I only need one of a- or b- to be able to progress. Happy to do
>>>>>>> either. I don't need c- because I've worked around, but it would have been
>>>>>>> nice to have some control over that. *
>>>>>>>
>>>>>>> pros & cons:
>>>>>>> a- more extensible in the future, other frontends don't benefit from
>>>>>>> enhancements
>>>>>>> b- solves an immediate problem, but impacts all frontends (i.e.
>>>>>>> space used in index).
>>>>>>>
>>>>>>> The only other bit in my mind is whether we need to ensure
>>>>>>> index-cross-application compatibility. I suspect some of this will tie in
>>>>>>> with the good work that Sijo has done on index management.
>>>>>>>
>>>>>>>
>>>>>>> The index management will be more critical with such a change. I've
>>>>>>> talked about having a manifest which defines the characteristics of the
>>>>>>> index. If we share an index created by two different systems, it will be
>>>>>>> important to "know" what an index supports.
>>>>>>>
>>>>>>> One of the changes that is being worked on is the update to a more
>>>>>>> recent version of Lucene. This affects how stemming is done. The way we are
>>>>>>> doing it now is deprecated and dropped.
>>>>>>>
>>>>>>>
>>>>>>> Let me know what your preferences are.
>>>>>>>
>>>>>>>
>>>>>>> Progress not perfection. Shared, configurable changes.
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> jsword-devel mailing list
>>>>>>> jsword-devel at crosswire.org
>>>>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> jsword-devel mailing list
>>>>>>> jsword-devel at crosswire.org
>>>>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Sijo
>>>>>>
>>>>>> _______________________________________________
>>>>>> jsword-devel mailing list
>>>>>> jsword-devel at crosswire.org
>>>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> jsword-devel mailing list
>>>>> jsword-devel at crosswire.org
>>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Sijo
>>>>
>>> _______________________________________________
>>> jsword-devel mailing list
>>> jsword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>> Sijo
>>
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>
>>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140421/d8098d33/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 43228 bytes
Desc: not available
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140421/d8098d33/attachment-0001.png>


More information about the jsword-devel mailing list