FW: [osis-core] character counting issue: proposed solution

Harry Plantinga osis-core@bibletechnologieswg.org
Tue, 18 Jun 2002 10:57:52 -0400


Yesterday I posted a problem with counting characters in
unicode, namely you can encode some accented characters in
different ways that have different numbers of characters.

Here is a proposed solution.

1.  The _official_ character count is in a normalized version
of the text, which uses minimal-length encodings of all
characters.

2.  For a given grain, e.g. @char:52(Hello world!), if the
52nd character isn't the start of the string "Hello world!",
point to the first occurrence of "Hello world!" after the 52nd
character.

3.  Recommend that when counting characters, don't count accents
and other modifiers. This may underestimate the number of unicode
characters slightly if there are some accented combinations that
don't have a single-character representation, but in conjunction
with (2) above, will normally give the right result. Especially
if the string is unique.

4.  For people who don't like counting characters and can identify
unique strings, allow @char:0(Hello world!).  Actually, this
is implied by 3 above. 

5.  (Extra credit). Allow @(Hello world!) as a shortcut for
@char:0(Hello world!).

-Harry