[sword-devel] imp2ld and alphabetization

Sun Oct 28 19:57:10 MST 2007

I'm not sure if I am reading the Sword code correctly, but it appears  
that it is sorting at a byte level and not a character level. That  
isn't by code points.

I think that we discussed this a little bit ago and concluded that  
some work needs to be done in the engine.

Her is my thought on the matter, for what it is worth. Today the sort  
serves two purposes: order and search. But it is search that  
constrains the order to be as it is. I think that if we could search  
independently of the order of keys in the module that would be ideal.

One simple way for any application to provide this is to create a  
Lucene index similar to what we do for a Bible for the dictionary  
(similar to what we do for a Bible) that consists of the term (stored  
and indexed), the offset (stored) in the module (so it can be  
retrieved and previous and next indexes can be found), the entry for  
the term (indexed, but not stored). The application can then create  
any kind of collation of the keys (using the excellent facilities of  
ICU) that suite the user's needs. Then using this double handle  
present the keys in part (as in BibleCS) or whole (as in  
BibleDesktop, MacSword, ...) in the order that the user expects.

There are some related problems to this:
A user may expect to be able to find a Hebrew word in a Hebrew  
dictionary independent of the pointing of the word in the dictionary.  
(i.e. a user may wish to search without specifying accents)
A user may expect to find a word by stem not just by prefix.
A user may expect to be able to type "photos" (a transliteration) and  
find the real Greek word in a Greek dictionary.

I'm cross-posting to J-Sword because this will be of interest there  
as well.

In His Service,
	DM Smith

On Oct 28, 2007, at 9:13 PM, Frank wrote:

> peter wrote:
>> Is this really only a Vietnamese problem, but will not all latinate
>> scripts with extra signs have exactly the same problem?
>>
>> Or actually all scripts which are treated as derrived scripts -  
>> Farsi,
>> urdu and Malay from Arabic, Tajik, Uzbek, Azeri from Russian etc -  
>> the
>> code points are initially for the "main" characters and then there  
>> is a
>> always bunch of extra characters which are used only in one or other
>> language.
>>
>> But maybe I am just showing my ignorance here. I need to look at some
>> dictionaries - never had any installed.
> Any language that uses letters outside the ASCII range will be  
> affected
> unless the collate the letter after "z"... and if it's strictly in
> Unicode point order, then all upper case will collate before lower  
> case...
>
> -- 
> Blessings
>
> Frank
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page