[sword-devel] imp2ld and alphabetization
dmsmith555 at yahoo.com
Mon Oct 29 06:12:59 MST 2007
I have entered this into our "bugs" database: http://
Serving Him Together,
On Oct 29, 2007, at 1:00 AM, Troy A. Griffitts wrote:
> Yes, everyone is correct that the .next() method on a Lexicon/
> module will show the next value in the index-- not necessarily the
> value alphabetized in any humanly useful order.
> The purpose for the index is fast lookups.
> We have a few issues to solve here and DM and others have given good
> 1 solutions to 1 small part of the problem, at least for frontends
> load the entire index: sort the keys however you want using ICU or
> whatever Unicode/localization tools your toolkit provides.
> Retaining the import order isn't necessarily straight forward. The
> SWORD API exposes a dynamic modification interface which allows
> and insert at any time. Technically, entries can be added, removed,
> modified, etc. to update and maintain any module. Practically, this
> doesn't usually happen (we just import with a tool once and never
> again), but with some of the new community editing projects and
> tools in
> the works, this may be a more common event.
> Generating a secondary index on a lexdict which preserves some other
> order and alternate key is great idea and an easy addition to the
> current code.
> I am not in favor of using lucene for any core functionality as it
> mandate a requirement, which is not practical on all platforms. We
> easily implement the same thing with our rawstr index without
> this penalty.
> This is a good item to consider for 1.5.11
> DM Smith wrote:
>> I'm not sure if I am reading the Sword code correctly, but it appears
>> that it is sorting at a byte level and not a character level. That
>> isn't by code points.
>> I think that we discussed this a little bit ago and concluded that
>> some work needs to be done in the engine.
>> Her is my thought on the matter, for what it is worth. Today the sort
>> serves two purposes: order and search. But it is search that
>> constrains the order to be as it is. I think that if we could search
>> independently of the order of keys in the module that would be ideal.
>> One simple way for any application to provide this is to create a
>> Lucene index similar to what we do for a Bible for the dictionary
>> (similar to what we do for a Bible) that consists of the term (stored
>> and indexed), the offset (stored) in the module (so it can be
>> retrieved and previous and next indexes can be found), the entry for
>> the term (indexed, but not stored). The application can then create
>> any kind of collation of the keys (using the excellent facilities of
>> ICU) that suite the user's needs. Then using this double handle
>> present the keys in part (as in BibleCS) or whole (as in
>> BibleDesktop, MacSword, ...) in the order that the user expects.
>> There are some related problems to this:
>> A user may expect to be able to find a Hebrew word in a Hebrew
>> dictionary independent of the pointing of the word in the dictionary.
>> (i.e. a user may wish to search without specifying accents)
>> A user may expect to find a word by stem not just by prefix.
>> A user may expect to be able to type "photos" (a transliteration) and
>> find the real Greek word in a Greek dictionary.
>> I'm cross-posting to J-Sword because this will be of interest there
>> as well.
>> In His Service,
>> DM Smith
>> On Oct 28, 2007, at 9:13 PM, Frank wrote:
>>> peter wrote:
>>>> Is this really only a Vietnamese problem, but will not all latinate
>>>> scripts with extra signs have exactly the same problem?
>>>> Or actually all scripts which are treated as derrived scripts -
>>>> urdu and Malay from Arabic, Tajik, Uzbek, Azeri from Russian etc -
>>>> code points are initially for the "main" characters and then there
>>>> is a
>>>> always bunch of extra characters which are used only in one or
>>>> But maybe I am just showing my ignorance here. I need to look at
>>>> dictionaries - never had any installed.
>>> Any language that uses letters outside the ASCII range will be
>>> unless the collate the letter after "z"... and if it's strictly in
>>> Unicode point order, then all upper case will collate before lower
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> Instructions to unsubscribe/change your settings at above page
>> sword-devel mailing list: sword-devel at crosswire.org
>> Instructions to unsubscribe/change your settings at above page
> sword-devel mailing list: sword-devel at crosswire.org
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel