[sword-devel] better UTF-sensitive sort

Aaron Christianson ninjaaron at gmail.com
Wed Jan 13 21:55:18 MST 2016


Just a heads up that simply using Unicode or locale-based sorting for
Hebrew with vowels and accents does not provide the correct order!
Pointed Hebrew is supposed to be sorted as if the various diacritics
aren't there (except for sin and shin) and then vowels are used as a
secondary criterion (the order of which varies from source to source).
I've been in correspondence with the Academy for the Hebrew Language
in Israel about this very topic.

The problem with the Hebrew vowels is that almost all of them are
represented as unicode combining charcters (which have their own code
points) instead of having unique code points for every possible
character (there would be too many anyway) that would be more helpful
for locale-based collation strings.

I've written a script that properly sorts pointed Hebrew for the
glossary of the Hebrew grammar I'm working on, and I'd be happy to
share it, but I'm not sure how practical it is to have a unique sort
method for one problem language. (On the other hand, perhaps it is
worth it, since it fixes a problem for two of the three languages the
Bible was actually written in)

On Wed, Jan 13, 2016 at 2:37 PM, Karl Kleinpaste <karl at kleinpaste.org> wrote:
> On 01/12/2016 11:32 AM, DM Smith wrote:
>
> Is ICU4C out of the question?
>
> Thanx for the pointer.  It took a bit more contemplation than it probably
> should have, but I used ucol_strcollUTF8() (in icu-i18n) and it seems fine.
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page



More information about the sword-devel mailing list