[sword-devel] imp2ld and alphabetization

Troy A. Griffitts scribe at crosswire.org
Sun Oct 28 22:00:48 MST 2007


Yes, everyone is correct that the .next() method on a Lexicon/Dictionary 
module will show the next value in the index-- not necessarily the next 
value alphabetized in any humanly useful order.

The purpose for the index is fast lookups.

We have a few issues to solve here and DM and others have given good 
suggestions.

1 solutions to 1 small part of the problem, at least for frontends that 
load the entire index: sort the keys however you want using ICU or 
whatever Unicode/localization tools your toolkit provides.

Retaining the import order isn't necessarily straight forward.  The 
SWORD API exposes a dynamic modification interface which allows deletes 
and insert at any time.  Technically, entries can be added, removed, 
modified, etc. to update and maintain any module.  Practically, this 
doesn't usually happen (we just import with a tool once and never modify 
again), but with some of the new community editing projects and tools in 
the works, this may be a more common event.

Generating a secondary index on a lexdict which preserves some other 
order and alternate key is great idea and an easy addition to the 
current code.

I am not in favor of using lucene for any core functionality as it would 
mandate a requirement, which is not practical on all platforms.  We can 
easily implement the same thing with our rawstr index without incurring 
this penalty.

This is a good item to consider for 1.5.11

	-Troy.


DM Smith wrote:
> I'm not sure if I am reading the Sword code correctly, but it appears  
> that it is sorting at a byte level and not a character level. That  
> isn't by code points.
> 
> I think that we discussed this a little bit ago and concluded that  
> some work needs to be done in the engine.
> 
> Her is my thought on the matter, for what it is worth. Today the sort  
> serves two purposes: order and search. But it is search that  
> constrains the order to be as it is. I think that if we could search  
> independently of the order of keys in the module that would be ideal.
> 
> One simple way for any application to provide this is to create a  
> Lucene index similar to what we do for a Bible for the dictionary  
> (similar to what we do for a Bible) that consists of the term (stored  
> and indexed), the offset (stored) in the module (so it can be  
> retrieved and previous and next indexes can be found), the entry for  
> the term (indexed, but not stored). The application can then create  
> any kind of collation of the keys (using the excellent facilities of  
> ICU) that suite the user's needs. Then using this double handle  
> present the keys in part (as in BibleCS) or whole (as in  
> BibleDesktop, MacSword, ...) in the order that the user expects.
> 
> There are some related problems to this:
> A user may expect to be able to find a Hebrew word in a Hebrew  
> dictionary independent of the pointing of the word in the dictionary.  
> (i.e. a user may wish to search without specifying accents)
> A user may expect to find a word by stem not just by prefix.
> A user may expect to be able to type "photos" (a transliteration) and  
> find the real Greek word in a Greek dictionary.
> 
> I'm cross-posting to J-Sword because this will be of interest there  
> as well.
> 
> In His Service,
> 	DM Smith
> 
> 
> On Oct 28, 2007, at 9:13 PM, Frank wrote:
> 
>> peter wrote:
>>> Is this really only a Vietnamese problem, but will not all latinate
>>> scripts with extra signs have exactly the same problem?
>>>
>>> Or actually all scripts which are treated as derrived scripts -  
>>> Farsi,
>>> urdu and Malay from Arabic, Tajik, Uzbek, Azeri from Russian etc -  
>>> the
>>> code points are initially for the "main" characters and then there  
>>> is a
>>> always bunch of extra characters which are used only in one or other
>>> language.
>>>
>>> But maybe I am just showing my ignorance here. I need to look at some
>>> dictionaries - never had any installed.
>> Any language that uses letters outside the ASCII range will be  
>> affected
>> unless the collate the letter after "z"... and if it's strictly in
>> Unicode point order, then all upper case will collate before lower  
>> case...
>>
>> -- 
>> Blessings
>>
>> Frank
>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
> 
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page




More information about the sword-devel mailing list