FW: [osis-core] character counting issue: proposed solution

Steve DeRose osis-core@bibletechnologieswg.org
Wed, 19 Jun 2002 21:15:51 -0400


At 10:57 AM -0400 06/18/02, Harry Plantinga wrote:
>Yesterday I posted a problem with counting characters in
>unicode, namely you can encode some accented characters in
>different ways that have different numbers of characters.
>
>Here is a proposed solution.
>
>1.  The _official_ character count is in a normalized version
>of the text, which uses minimal-length encodings of all
>characters.
>
>2.  For a given grain, e.g. @char:52(Hello world!), if the
>52nd character isn't the start of the string "Hello world!",
>point to the first occurrence of "Hello world!" after the 52nd
>character.
>
>3.  Recommend that when counting characters, don't count accents
>and other modifiers. This may underestimate the number of unicode
>characters slightly if there are some accented combinations that
>don't have a single-character representation, but in conjunction
>with (2) above, will normally give the right result. Especially
>if the string is unique.
>
>4.  For people who don't like counting characters and can identify
>unique strings, allow @char:0(Hello world!).  Actually, this
>is implied by 3 above.
>
>5.  (Extra credit). Allow @(Hello world!) as a shortcut for
>@char:0(Hello world!).
>
>-Harry

Kind of nice. The ligature you cited later remains a pain, though.
I'm not sure bagging the offset helps much since as you pointed out, 
the string matching still has to assume same encoding.

I see two other possible solutions:

1) change from 'character' to 'code point' ('cp:') and say it's 
defined to be stupid and just count and compare Unicode code points, 
which are well-defined. this wouldn't work across systems that insist 
on changing the representation of data they import, but that 
shouldn't be so bad a problem, I would hope.

2) Insist on Form C. It is a pain to implement, although there are 
probably utilities and source around to do it.

I think by Occam's Razor I'd go for #1.

-- 

Steve DeRose -- http://www.stg.brown.edu/~sjd
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@speakeasy.net
Backup email: sderose@mac.com, sjd@stg.brown.edu