[sword-devel] Improvements in dictionary collation. was Re: AbbottSmith module question

Fri Jan 15 09:39:40 MST 2016

Hi,

I've been working on a rudimentary Greek lexicon, covering both the New 
Testament and the Septuagint.  In the process, I was faced with this 
issue.  After all the discussion and work in the Hebrew lexicon, I 
reevaluated my approach.  What I finally decided on was using 
unaccented, lower case forms of the lemmas in the lexicon. This then 
automatically sorts properly.  In the relatively few cases of 
duplication, I append a .1, .2, etc.  This represents a small percentage 
of the entries in the lexicon.  So the keys of the lexicon are related 
to the lemmas, lexically, but the lemmas retain the form to be displayed 
in listing the lexicon.

Hope this helps.

Peace,

David

On 1/15/2016 9:36 AM, DM Smith wrote:
>> On Jan 15, 2016, at 8:01 AM, Jonathan Morgan <jonmmorgan at gmail.com 
>> <mailto:jonmmorgan at gmail.com>> wrote:
>>
>> Hi DM,
>>
>> On Fri, Jan 15, 2016 at 1:40 AM, DM Smith<dmsmith at crosswire.org 
>> <mailto:dmsmith at crosswire.org>>wrote:
>>
>>     I’ve been trawling through the code. Seems that there is support
>>     for Strong’s Numbers that are not padded. If a module contains
>>     Strong’s Numbers that are not padded, it is to use
>>     StrongsPadding=false. (Actually any value other than “true” will
>>     be false. TRUE is false.) This module does not have it.
>>
>>     Not having StrongsPadding in a conf is the same as
>>     StrongsPadding=true. There’s a note in the wiki that says that
>>     we’ll probably reverse that in the future. I doubt it. We still
>>     have LZSS as the default compression though no module has used it
>>     for years (other than experimental modules).
>>
>>     I’m not sure how a Bible with a reference to G0001 will find G1
>>     as it doesn’t unpad the user’s input. But at least the dictionary
>>     should work. BTW, there’s a missing "if (strongsPadding)” in
>>     rawLD. It is present in zLD. I think this is a bug. Need to
>>     verify, report and submit a patch for it. (BTW, I don’t have
>>     write permissions either on the main repo, but I’m not
>>     discouraged in contributing and submitting patches.)
>>
>>
>> Sorry if I'm missing something, but surely keys without padding 
>> wouldn't appear in the correct (numeric) order in the dictionary?
>>
>> Jon
>
> Jon,
>
> Right. They will be in collation order, not numerical order. It 
> doesn’t work as a SWORD module for that reason and was my primary 
> motivation for moving it to the Experimental repository. The tei2mod 
> program needs to add support for Strong’s numbers as imp2ld has. It 
> doesn’t pad the values as it puts them into the module.
>
> The ordering problem is a more general problem. Our collation order is 
> good for ASCII. It is not good for Latin-1 as the byte value for 
> accented letters is not adjacent to unaccented counterparts.
>
> Each language, script combination has its own collation order. Some 
> languages use multiple glyphs for a single letter. This was noted 
> earlier this month on this mailing list.
>
> In a past job, I had to implement a sort routine that would account 
> for numbers occurring anywhere in a string. What I discovered in the 
> process of doing this was that there is a need for an internal 
> representation that differs from an external representation and 
> routines that would normalize an external representation to an 
> internal representation. Basically that routine would look at a string 
> as an alternating sequence of numbers and non-numbers. The routine 
> external2internal would create a string where numbers were zero padded 
> to 10 digits. (It also did other things like strip noise words from 
> the string, normalize dotted acronyms, normalize casing, …).
>
> Also in an earlier posting this month, I mentioned that ICU has 
> collation routines that are language and script sensitive. The 
> collation values that these produce are good for byte-order sorting, 
> but are not intended for external use.
>
> What we need is a dictionary that stores the case-insensitive keys and 
> that the frontend can collate as it sees fit. That collation order 
> would be used to sort and show the case-insensitive keys. Basically 
> another layer of indirection with a mapping from external presentation 
> to the internal storage of the module.
>
> We’ve talked about this before. I think Troy suggested a mechanism.
>
> I’m going to survey the lexdict modules in all the repos in the Master 
> list (and a few others) to see where we stand with those modules and 
> the StrongsPadding flag. If any key starts with a number and isn’t 
> zero padded, it will have difficulty if StrongsPadding=false is not in 
> the conf. If a module has some that are zero padded and others that 
> are not, this also is a problem.
>
> DM
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20160115/000966ae/attachment.html>