[sword-devel] demo TEI modules

Wed Sep 19 15:30:28 MST 2007

On Sep 19, 2007, at 5:49 PM, Chris Little wrote:

>
>
> Troy A. Griffitts wrote:
>> We probably need to do a few things here besides toupper (to assure
>> entry matches), as we've learned and done in our search code.  We
>> probably should at least normalize the utf8.  This is not a big hit
>> because it is only done on module creation for every key, and then  
>> once
>> for the input word before the binary search starts.
>
> I wish we could display keys in non-touppered form. Capitals are so
> ugly, especially outside of basic modern western European languages.

I would like to see that too.

>
>> We could change the actual order to use a utf8 strcmp method, but  
>> this
>> would likely come with a relatively significant performance hit  
>> (though
>> maybe not-- the binary search algol will significantly limit the  
>> number
>> of actual utf8 strcmp operations we would need to perform).  This  
>> change
>> would require remaking any modules which use multibyte utf8 keys.
>
> Collation is tricky. For one, it is always language-dependent. We have
> all the necessary data (at least for modern languages) in ICU, but  
> using
> that means requiring ICU, which I'm quite fine with for desktop/server
> frontends, but isn't as practical for handhelds.
>
> Independent of basic, language-wide collation standards, some  
> dictionary
> editors pick different sort orders. The only way to cater to that  
> is to
> store the records in their own module-specific order (e.g. using a
> GenBook-based system for the whole thing or somehow throwing away the
> binary search system). Given that most front-ends are listing the
> complete contents of the LD modules, which negates the utility of the
> binary searches, it might not be a bad idea to scrap the current  
> system
> and make key-entry operate as a pattern-matching search (maybe  
> regex?).

I think this can be fairly easily accomplished.

For one project I worked on I had the requirement to collate a list  
of titles according to 5 different languages' collation rules. To  
solve this we needed a way to normalize the user's search request to  
one of the 5 ways and we also normalized the titles according to the  
5 ways so that binary lookup would work. The original index was a  
strict byte collation (just like we have in Sword) and we retained  
that. We essentially built 5 parallel indexes that for their value  
held pointers to the original's location in the file. This allowed  
binary search on the parallel indexes. We never showed the normalized  
form to the users.

Now with regard to Sword, I think that we don't need more than one  
collation. I think that the module author should be able to control  
the choice. So normalization can be any single algorithm we want. I'd  
suggest using ICU to build the normalized key. If ICU is present, it  
will iterate over the new index, otherwise, it will fallback to the  
old method. If the new method is used, then mixed case can be shown.