[sword-devel] Lexical fields
chris at burrell.me.uk
Thu Aug 22 13:14:29 MST 2013
For our lexicon, we don't use SWORD modules because they aren't flexible
enough. The main drawback was the lack of segregation of different parts of
data. I'm not an expert but I didn't think the current OSIS would let you
do what I attached in the previous files, such that you could retrieve them
separately. We also wanted control of how the indexing would happen.
On the other hand, STEP's datasets are all based off Lucene, so there's no
reason why a new 'flexible' Sword module format couldn't be created. Lucene
has the concept of Stored and Analyzed. So in terms of the accents, we
strip accents off and index the non-accented version (but don't store it).
Then when it comes to searching, all you need to do is strip the accents
off the input. We also have the accented version in the use cases we need
it for, and in this case it is stored, because it is displayed to the user.
Non unique fields are very easy in Lucene. For example we have the
relatedStrongs field that we generated from looking at many different
dictionaries and picking out some the strong numbers mentioned in the same
entries, and then a scholar in Tyndale sanitizing/checking. So this isn't
quite lexically-related, although in most cases it is. You can also store
multiple values in your Lucene fields.
A current Sword module has two components: the data files + the index. The
index is used for searching, and the data files for retrieving the document
content, generally at a key level (although I believe hierachical keys are
supported). In a lot of the STEP datasets, we ship the data files, but
don't use them after building the index, since Lucene has now stored the
relevant fields and analysed the others (you can tell Lucene how to analyze
your particular fields). We use Lucene for both the data and the index.
So a similar concept would be very easy to integrate in to the (J)Sword
engine. (I know JSword, not Sword in terms of the engine).
On 22 August 2013 10:35, Timothy S. Nelson <wayland at wayland.id.au> wrote:
> On Mon, 19 Aug 2013, Chris Burrell wrote:
>> Tyndale House (Cambridge) have devised an automatic way of
>> transliterating both Greek and
>> Hebrew with syllable markers in the Hebrew. STEP uses this as part of
>> searches and auto
>> completes as well as interlinears. The scheme is still in beta and being
>> refined but shout if
>> you want the code and I can point you to it. Or shout if you want
> Nice! Not at the moment, but I may ask someday.
> I also have a mapping of all the strongs and all possible forms found in
>> the Hebrew and Greek
>> texts. That was based on existing Sword modules.
> That sounds pretty cool too.
> Tyndale have created a lexicon with lexical forms attached to every
>> strong number based on
>> LSJ and BDB and some others. Actually the number of fields we have is a
>> lot more than the
>> ones you suggest. Couple of samples attached. (Transliterations aren't in
>> the files as we
>> generate those as we index the data. )
>> leu me know of yow interested.
> It looks very interesting. My question is, does SWORD cope with
> all of these fields? Can words be looked up by each of them? It seems to
> me that SWORD can cope with only one of the fields being the key, and the
> others can't be. I think it would be useful, for example, to have
> dictionaries with multiple keys, so that items can be looked up by lexical
> item or Strongs Number. It might also mean that it would be possible to
> have fields with and without breathings, accents, and the like, which could
> be useful for searching purposes.
> It might also be useful to be able to index words on non-unique
> fields. For example, if there were a "root" field, there might be multiple
> words derived from the same root, and someone might want to do a search for
> all of them.
> Does anyone know whether these kinds of things are possible?
> | Name: Tim Nelson | Because the Creator is, |
> | E-mail: wayland at wayland.id.au | I am |
> ----BEGIN GEEK CODE BLOCK----
> Version 3.12
> GCS d+++ s+: a- C++$ U+++$ P+++$ L+++ E- W+ N+ w--- V- PE(+) Y+>++
> PGP->+++ R(+) !tv b++ DI++++ D G+ e++>++++ h! y-
> -----END GEEK CODE BLOCK-----
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the sword-devel