[sword-devel] Stem searching
Troy A. Griffitts
scribe at crosswire.org
Thu Jul 12 10:01:15 MST 2012
A relational database will not contribute more to a solution than what
we have available in lucene. What I failed to get across in my last
email, due to too much caffeine, was that a verse's declension data by
itself is useless without being attached to the lemma which each morph
code in the declension data modifies.
We have 2 things for each word:
root at declension
we refer to these as:
lemma at morph
root, stem, lemma, in this discussion are all synonyms.
Currently in our lucene index we have a field called 'lemma', so for a
verse with 5 words, this field might look something like this:
lem1 lem2 lem3 lem4
and we can do searches for all verses with lem3
great, but this ignores the declension data; e.g., was lem3 a 1st person
or 2nd person noun? Ignoring declension is usually desired when doing
word studies, and why we have the 'lemma' lucene index in the first
place. You don't want to have to search for all forms of a word to do a
... but sometimes you only care about 1 form of a word when doing a
study, so how do we incorporate the declension information?
It would be useless to create a 'morph' field with contents for the same
mor1 mor2 mor3 mor4
In this scenario, you could construct a clucene search using both fields
but this would not return what you desire. This would return all verses
which have a lem2 in the lemma field and a mor2 in the morph field, but
not necessarily together.
So... the proposed solution...
We have created a new field called 'morph' which will probably replace
the lemma field and has data as:
lem1 at mor1 lem2 at mor2 lem3 at mor3 lem4 at mor4
This allows a lucene search to be create like this:
morph:lem2 at mor2
or to get the functionality of the current 'lemma' field-- which ignores
declension, the equiv search using the 'morph' field would be:
this allows all kinds of queries, like: give me all verses which have
lem1 and lem2 within 4 words of each other and lem2 must have the
morph:"lem1@* lem2 at mor2"~4
Hope this make things clearer if there were any clouds :)
On 07/12/2012 02:17 PM, Chris Burrell wrote:
> Thanks Troy. That helps put the task in perspective... An alternative
> would possibly be to store both strong and morphology indexes in a
> relational database. Then have a table mapping all the data together.
> I guess the mapping table would be based on one version of the Bible
> On 11 July 2012 01:09, Troy A. Griffitts <scribe at crosswire.org
> <mailto:scribe at crosswire.org>> wrote:
> We're toyed around with the best way to add lemma+morph searching
> in SWORD but haven't finalized anything yet.
> Indexing Morphology codes won't helps. This would give you 2
> fields which need to be used together.
> For example, if you wish to find ????? only in the nominative
> within 3 words of any present, active, indicative, 2 persons
> singular or plural verb, you could not satisfy your search.
> Believe it or not, end users of tools like Bibleworks seem quite
> happy to learn odd syntax like:
> "?????@* *@PAI2?"~3
> Of course GUI tools to help build that syntax for them is also
> This it the direction we're heading, but would require lemma
> encoding changed from strongs to lexical form.
> Presently we could nearly obtain this by building an index as
> (from the start of John 1.1):
> G1722 at PREP G746 at N-DSF G2258 at V-IXI-3S
> But this would require users to know strongs numbers rather than
> lexical form, which would almost certainly need a GUI to help them
> build the search syntax.
> Hope this helps,
> On 07/10/2012 11:41 PM, Chris Burrell wrote:
>> Does anyone know/tried some kind of stem search with JSword? Is
>> it implemented? Or would we need to do a bit more work there?
>> jsword-devel mailing list
>> jsword-devel at crosswire.org <mailto:jsword-devel at crosswire.org>
> jsword-devel mailing list
> jsword-devel at crosswire.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the sword-devel