[sword-devel] imp2ld and alphabetization

DM Smith dmsmith555 at yahoo.com
Mon Oct 29 06:12:59 MST 2007


I have entered this into our "bugs" database: http:// 
www.crosswire.org/bugs/browse/API-91

Serving Him Together,
	DM

On Oct 29, 2007, at 1:00 AM, Troy A. Griffitts wrote:

> Yes, everyone is correct that the .next() method on a Lexicon/ 
> Dictionary
> module will show the next value in the index-- not necessarily the  
> next
> value alphabetized in any humanly useful order.
>
> The purpose for the index is fast lookups.
>
> We have a few issues to solve here and DM and others have given good
> suggestions.
>
> 1 solutions to 1 small part of the problem, at least for frontends  
> that
> load the entire index: sort the keys however you want using ICU or
> whatever Unicode/localization tools your toolkit provides.
>
> Retaining the import order isn't necessarily straight forward.  The
> SWORD API exposes a dynamic modification interface which allows  
> deletes
> and insert at any time.  Technically, entries can be added, removed,
> modified, etc. to update and maintain any module.  Practically, this
> doesn't usually happen (we just import with a tool once and never  
> modify
> again), but with some of the new community editing projects and  
> tools in
> the works, this may be a more common event.
>
> Generating a secondary index on a lexdict which preserves some other
> order and alternate key is great idea and an easy addition to the
> current code.
>
> I am not in favor of using lucene for any core functionality as it  
> would
> mandate a requirement, which is not practical on all platforms.  We  
> can
> easily implement the same thing with our rawstr index without  
> incurring
> this penalty.
>
> This is a good item to consider for 1.5.11
>
> 	-Troy.
>
>
> DM Smith wrote:
>> I'm not sure if I am reading the Sword code correctly, but it appears
>> that it is sorting at a byte level and not a character level. That
>> isn't by code points.
>>
>> I think that we discussed this a little bit ago and concluded that
>> some work needs to be done in the engine.
>>
>> Her is my thought on the matter, for what it is worth. Today the sort
>> serves two purposes: order and search. But it is search that
>> constrains the order to be as it is. I think that if we could search
>> independently of the order of keys in the module that would be ideal.
>>
>> One simple way for any application to provide this is to create a
>> Lucene index similar to what we do for a Bible for the dictionary
>> (similar to what we do for a Bible) that consists of the term (stored
>> and indexed), the offset (stored) in the module (so it can be
>> retrieved and previous and next indexes can be found), the entry for
>> the term (indexed, but not stored). The application can then create
>> any kind of collation of the keys (using the excellent facilities of
>> ICU) that suite the user's needs. Then using this double handle
>> present the keys in part (as in BibleCS) or whole (as in
>> BibleDesktop, MacSword, ...) in the order that the user expects.
>>
>> There are some related problems to this:
>> A user may expect to be able to find a Hebrew word in a Hebrew
>> dictionary independent of the pointing of the word in the dictionary.
>> (i.e. a user may wish to search without specifying accents)
>> A user may expect to find a word by stem not just by prefix.
>> A user may expect to be able to type "photos" (a transliteration) and
>> find the real Greek word in a Greek dictionary.
>>
>> I'm cross-posting to J-Sword because this will be of interest there
>> as well.
>>
>> In His Service,
>> 	DM Smith
>>
>>
>> On Oct 28, 2007, at 9:13 PM, Frank wrote:
>>
>>> peter wrote:
>>>> Is this really only a Vietnamese problem, but will not all latinate
>>>> scripts with extra signs have exactly the same problem?
>>>>
>>>> Or actually all scripts which are treated as derrived scripts -
>>>> Farsi,
>>>> urdu and Malay from Arabic, Tajik, Uzbek, Azeri from Russian etc -
>>>> the
>>>> code points are initially for the "main" characters and then there
>>>> is a
>>>> always bunch of extra characters which are used only in one or  
>>>> other
>>>> language.
>>>>
>>>> But maybe I am just showing my ignorance here. I need to look at  
>>>> some
>>>> dictionaries - never had any installed.
>>> Any language that uses letters outside the ASCII range will be
>>> affected
>>> unless the collate the letter after "z"... and if it's strictly in
>>> Unicode point order, then all upper case will collate before lower
>>> case...
>>>
>>> -- 
>>> Blessings
>>>
>>> Frank
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page




More information about the sword-devel mailing list