[sword-devel] Dictionary ordering

Greg Hellings greg.hellings at gmail.com
Thu Sep 18 09:14:16 MST 2008


DM,

On Thu, Sep 18, 2008 at 10:53 AM, DM Smith <dmsmith555 at yahoo.com> wrote:
> This thread has pointed out several issues that need to be solved:
> 1) Proper ordering of entries, where proper may be defined by a) the
> publisher (Liddell-Scott) or b) the language (Viet, Farsi) and not by
> strict byte ordering.
> 2) User input, with and without diacritics; with and without
> punctuation; with and without language spelling variations (e.g. ä
> becomes ae in German, æ becomes ae in English).
> 3) Alternate keys (e.g. Α, α and G0001 for Strong's Greek Dictionary)
> 4) Backward compatibility, i.e. don't change the current dictionary
> structure.
> 5) Efficient lookup.
>
> Regarding language ordering (1b) ICU defines a collation sort key that
> given a supported language and a string will produce a key that can be
> used for byte comparison. If an index is created, sorted on these keys,
> then the user input can be converted the same way and used for lookup.
>
> Regarding 2, this is the same issue we have regarding searching Biblical
> text with diacritics.
>
> Regarding 4, backward compatibility can be retained by creating the
> module as it is today, but add one or more additional files that address
> alternate means of lookup.
>
> Regarding 5, using linear search is not the answer. Adding one or more
> additional lookup files will solve that with double indirection: Lookup
> in new index results in entry in old index.
>
> Here is what I'd recommend:
> 1) Use ICU for collation sort keys for the primary sorting, putting this
> into an additional file. This file would otherwise be identical to the
> current idx. Programs that are aware of this file (e.g. 1.5.12+) would
> be able to take advantage of it. Older programs would continue as before.

Would ICU be sufficient for our tasks?  I know nothing about it, but
would it have sufficient support for all of the minority languages
which might be covered in the usage of scripture translations and
versions, and does it give sufficient support to things, such as the
desire to collect everything with "St." separate from the rest of the
entries which begin with "S?"  If it has those supports, then it
probably is perfect for what we're looking to do, otherwise, it might
restrict us more than we like.

--Greg

> 2) Use Lucene for search and retrieval, indexing multiple
> representations of each key and storing the idx info in it.
>
> In Him,
> DM
>
> Ben Morgan wrote:
>> Strong's numbers are padded by 0, which is why they currently sort
>> properly.
>>
>> Such a sort order sounds a good idea (though perhaps not for the
>> developers ;).
>> Modules produced in such a way wouldn't be compatible with Sword <=
>> 1.5.11, though.
>>
>> I still suspect that sorting isn't quite as easy as specifying a
>> simple sort order such as you suggested.
>> Once diacritics enter into it (especially non-composed diacritics),
>> things could get a little more difficult.
>> Perhaps allowing the delimiter separated variables to be longer than 1
>> character long might help.
>>
>> This still won't catch everything, but it would be a good thing to
>> have - I think I've seen english dictionaries which put "St.
>> Something" entries at the start of S...
>>
>> As for the example 'tis, you can't catch everything. This is when you
>> want to do a search on the keys of the dictionary.
>>
>> God Bless,
>> Ben
>> -------------------------------------------------------------------------------------------
>> The Lord is not slow to fulfill his promise as some count slowness,
>> but is patient toward you, not wishing that any should perish,
>> but that all should reach repentance.
>> 2 Peter 3:9 (ESV)
>>
>>
>> 2008/9/18 Greg Hellings <greg.hellings at gmail.com
>> <mailto:greg.hellings at gmail.com>>
>>
>>     On Wed, Sep 17, 2008 at 10:38 PM, Daniel Owens <dhowens at pmbx.net
>>     <mailto:dhowens at pmbx.net>> wrote:
>>     >
>>     >
>>     > Greg Hellings wrote:
>>     >
>>     > On Wed, Sep 17, 2008 at 9:56 PM, Daniel Owens <dhowens at pmbx.net
>>     <mailto:dhowens at pmbx.net>> wrote:
>>     >
>>     >
>>     > Ben,
>>     >
>>     > Thanks for the explanation. It seems to me that setting up
>>     dictionaries to
>>     > use key retrieval from an uncompressed file with one key per
>>     line (ordered
>>     > as the module creator orders it) makes the most sense to me. If
>>     that helps
>>     > increase efficiency and preserves the order of dictionary
>>     entries, then that
>>     > is what we want.
>>     >
>>     >
>>     > Would it also be possible to put a space-delimited (or anything else
>>     > delimited) list of the order that characters ought to be arranged in
>>     > for a given dictionary? Then the module creator could put them in
>>     > whatever is desired in the import file, and the ordering can be
>>     based
>>     > off of the configuration file. Sorting would be as simple as
>>     > replacing the characters in each entry with an integer and
>>     sorting the
>>     > resulting vectors. In the absence of a sort-field, then the module
>>     > import file's order could default (or the current behavior,
>>     whichever
>>     > is deemed better)?
>>     >
>>     >
>>     > --Greg
>>     >
>>     >
>>     >
>>     > What if one of the TEI elements were an integer (much like
>>     Strong's)? The
>>     > dictionary could be sorted by that integer but entries would not
>>     display the
>>     > integer but rather the actual word entry.
>>
>>     I would suppose, in that case, the sorting could be left in the
>>     default mode of sorting based on the document's type. Alternatively
>>     two config entries could be devised.
>>     Sorting=None|Default|Config
>>     SortOrder=a b c d e...
>>
>>     The Sorting=None would be sorting left in the order of the import
>>     file, Default would be the current (and default) behavior and Config
>>     would indicate to follow the sorting order listed in SortOrder, which
>>     could be completely arbitrary, based on the module creator's
>>     preferences. For a language or a listing which used numerals, like
>>     Strongs, they would not be perturbed by either the original scheme or
>>     this expanded suggestion. Since the characters of the Strongs entries
>>     are distinct from integers, and the mapping would take characters into
>>     integers for the sorting process, then back into their original
>>     characters, no violence would be done to the Strongs numbers
>>     themselves. If there was a mixture of letters and numbers, it still
>>     wouldn't be a problem, and the module creator could include the
>>     integers wherever they wanted in the SortOrder listing.
>>
>>     --Greg
>>
>>     >
>>     > Daniel
>>     >
>>     > I will agree that the sorted order is not as important in
>>     BPBible because of
>>     > the lookup feature, but that breaks down when you need to browse
>>     further
>>     > within a range of entries. Furthermore, the example of "'tis"
>>     suggests that,
>>     > even in English, code pointing disturbs the natural order of the
>>     dictionary,
>>     > making it harder to browse for the right entry. Unless you type
>>     in the
>>     > apostrophe, you won't find "'tis" because it will not be near
>>     "t" but be at
>>     > the top of the dictionary, which is very far away. In
>>     BibleDesktop, which
>>     > doesn't yet have the lookup feature yet, you have to browse for any
>>     > dictionary entry (except Strongs, where the key is a number and
>>     therefore in
>>     > printed order!), so the ordering really does matter. Also,
>>     frankly, a
>>     > dictionary out of alphabetical order just looks silly. In
>>     Vietnamese it's
>>     > chaotic when dictionaries are ordered by code point. Who ever
>>     heard of a
>>     > dictionary where "d" comes after "z"? That's what happens in
>>     Vietnamese.
>>     >
>>     > Daniel
>>     >
>>     > Ben Morgan wrote:
>>     >
>>     > Hi Daniel,
>>     >
>>     > Code points are not the only way to sort it.
>>     > However, there does need to be a comparison function defined,
>>     which will
>>     > compare two words and give which is bigger.
>>     > This needs to be used consistently, from module creation to
>>     frontend. There
>>     > could be a library of defined comparators provided by SWORD -
>>     but you would
>>     > need one for each sort order you wanted (which approaches one
>>     per language).
>>     >
>>     > Personally, I don't find that sorted order is particularly
>>     important in
>>     > dictionaries - I would type in a word, and then hope that if it is a
>>     > different form of the word it would be relatively close. Some
>>     frontends may
>>     > not give the ability to type in words, though.
>>     >
>>     > But I haven't used dictionaries in other languages, so it may be
>>     different
>>     > for them - especially once diacritics are involved.
>>     >
>>     > The reasons why dictionaries are different from bibles are:
>>     > 1) Bibles have a known structure, which is hardcoded in the key
>>     type (this
>>     > is going to be able to change soon, for alternate versification,
>>     though -
>>     > probably leading to less efficient modules)
>>     > 2) Dictionaries can be much, much larger - Websters is a 14Mb
>>     download
>>     > compressed, as compared to the WEB's ~1.5Mb
>>     >
>>     > That's not to say the dictionaries can't be done more
>>     efficiently than they
>>     > are currently. Looking at the code, they could be quicker for
>>     the (common?)
>>     > case of incrementing a module. Currently they do a binary search
>>     for every
>>     > increment.
>>     > Further, they could probably be optimized for key retrieval -
>>     which is the
>>     > really important thing here. (For example by storing the keys
>>     separately,
>>     > uncompressed, 1 key per line)
>>     >
>>     > God Bless,
>>     > Ben
>>     >
>>     -------------------------------------------------------------------------------------------
>>     > The Lord is not slow to fulfill his promise as some count slowness,
>>     > but is patient toward you, not wishing that any should perish,
>>     > but that all should reach repentance.
>>     > 2 Peter 3:9 (ESV)
>>     >
>>     >
>>     > On Thu, Sep 18, 2008 at 11:21 AM, Daniel Owens <dhowens at pmbx.net
>>     <mailto:dhowens at pmbx.net>> wrote:
>>     >
>>     >
>>     >
>>     > Is code point order the ONLY way to sort dictionary entries?
>>     Surely there
>>     > is a solution which would retain the printed or intended order
>>     of dictionary
>>     > entries without giving up lots of efficiency. If not, I think
>>     users would
>>     > find a correctly ordered but slower dictionary to one which is
>>     fast but
>>     > jumbled up.
>>     >
>>     > At the very least, even if dictionaries aren't sorted by the
>>     printed order,
>>     > they should AT LEAST be in alphabetical order. To me that is a
>>     > non-negotiable for a dictionary--people depend on dictionaries
>>     being in the
>>     > right order, and code point order disturbs that for some
>>     languages. Here are
>>     > a couple of ideas:
>>     > - Could a configuration file of some sort be created to define a
>>     > sorted order for a given language that would actually be in
>>     alphabetical
>>     > order?
>>     > - Could a dictionary index be created to handle large dictionaries
>>     > which allows for the retention of the correct order of entries
>>     (whether that
>>     > is the printed order or alphabetical order)?
>>     > - Bibles are not ordered by code point, and we are able to
>>     search them
>>     > fairly quickly. Do dictionaries need to be compiled in a fashion
>>     similar to
>>     > Bibles?
>>     >
>>     > As it stands, dictionaries are NOT displayed in alphabetical
>>     order (at
>>     > least not Vietnamese, and apparently Farsi), which at best looks
>>     silly to
>>     > the user and at worst means you have to manually hunt around to
>>     find the
>>     > right entry, making a Genbook more efficient for the user in the
>>     end. But
>>     > then you lose the dictionary lookup feature.
>>     >
>>     > Daniel
>>     >
>>     > Ben Morgan wrote:
>>     >
>>     > The issue with ordering as I understand it is that if it is in
>>     (some form
>>     > of) sorted order, you can use binary search to find entries.
>>     > If you want order retained, it is best to use a genbook - but it
>>     won't be as
>>     > efficient, and may not have as good UI support.
>>     > With huge english dictionaries (like Webster's, for instance)
>>     this becomes
>>     > very important.
>>     >
>>     > >From BPBible's perspective, dictionary handling is done as follows:
>>     > 1. Read the index of the dictionary and divide by 4 or 6 to get
>>     the length
>>     > (depending on the driver)
>>     > 2. Set the virtual list length to the dictionary length
>>     > 3. When any item is displayed in the virtual list, it retrieves
>>     it from the
>>     > module.
>>     > 4. When the user starts typing in the text box above, it does a
>>     binary
>>     > search to find which item to display.
>>     >
>>     > 4 is already quite slow enough on big dictionaries - by having
>>     it unsorted,
>>     > it would make it quite a lot slower, I imagine.
>>     > All the keys from the module would have to be read in, which
>>     takes a while.
>>     >
>>     > God Bless,
>>     > Ben
>>     >
>>     ------------------------------------------------------------------------------------------
>>     > -
>>     > The Lord is not slow to fulfill his promise as some count slowness,
>>     > but is patient toward you, not wishing that any should perish,
>>     > but that all should reach repentance.
>>     > 2 Peter 3:9 (ESV)
>>     >
>>     >
>>     > On Thu, Sep 18, 2008 at 12:43 AM, Daniel Owens <dhowens at pmbx.net
>>     <mailto:dhowens at pmbx.net>>
>>     > <dhowens at pmbx.net <mailto:dhowens at pmbx.net>> wrote:
>>     >
>>     >
>>     >
>>     > mention that byte ordering does some strange things to Vietnamese
>>     > dictionaries. The Vietnamese script is a Latin script, but
>>     because it uses
>>     > some odd characters code point ordering results in illogical and
>>     > non-alphabetical entry ordering. For example, the "d" with a
>>     line through it
>>     > (đ) gets relegated to near the end of the dictionary instead of
>>     after the
>>     > regular "d" or anything with an apostrophe at the beginning of a
>>     word or
>>     > phrase gets moved to the top of the list regardless of the first
>>     letter
>>     > (such as 'tis). I am supportive of the IIRC general opinion. Let
>>     the module
>>     > creator worry about the ordering. Otherwise you get some very
>>     strange
>>     > dictionary behavior.
>>     >
>>     >
>>     >
>>     > ------------------------------
>>     >
>>     > _______________________________________________
>>     > sword-devel mailing list:
>>     >
>>     sword-devel at crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel
>>     <http://www.crosswire.org/mailman/listinfo/sword-devel>
>>     > Instructions to unsubscribe/change your settings at above page
>>     >
>>     >
>>     > --
>>     > PMBX license 1502
>>     >
>>     >
>>     > _______________________________________________
>>     > sword-devel mailing list: sword-devel at crosswire.org
>>     <mailto:sword-devel at crosswire.org>
>>     > http://www.crosswire.org/mailman/listinfo/sword-devel
>>     > Instructions to unsubscribe/change your settings at above page
>>     >
>>     >
>>     >
>>     > ________________________________
>>     > _______________________________________________
>>     > sword-devel mailing list: sword-devel at crosswire.org
>>     <mailto:sword-devel at crosswire.org>
>>     > http://www.crosswire.org/mailman/listinfo/sword-devel
>>     > Instructions to unsubscribe/change your settings at above page
>>     >
>>     > --
>>     > PMBX license 1502
>>     >
>>     > _______________________________________________
>>     > sword-devel mailing list: sword-devel at crosswire.org
>>     <mailto:sword-devel at crosswire.org>
>>     > http://www.crosswire.org/mailman/listinfo/sword-devel
>>     > Instructions to unsubscribe/change your settings at above page
>>     >
>>     >
>>     >
>>     > _______________________________________________
>>     > sword-devel mailing list: sword-devel at crosswire.org
>>     <mailto:sword-devel at crosswire.org>
>>     > http://www.crosswire.org/mailman/listinfo/sword-devel
>>     > Instructions to unsubscribe/change your settings at above page
>>     >
>>     >
>>     >
>>     > --
>>     > PMBX license 1502
>>     >
>>     > _______________________________________________
>>     > sword-devel mailing list: sword-devel at crosswire.org
>>     <mailto:sword-devel at crosswire.org>
>>     > http://www.crosswire.org/mailman/listinfo/sword-devel
>>     > Instructions to unsubscribe/change your settings at above page
>>     >
>>     _______________________________________________
>>     sword-devel mailing list: sword-devel at crosswire.org
>>     <mailto:sword-devel at crosswire.org>
>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>     Instructions to unsubscribe/change your settings at above page
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page


More information about the sword-devel mailing list