FW: [osis-core] character counting issue: proposed solution

Patrick Durusau osis-core@bibletechnologieswg.org
Tue, 18 Jun 2002 16:02:24 -0400


Harry,

I will be trying to work through your post and the W3C's position on the 
character set model (http://www.w3.org/TR/charmod/). If you have the 
time, can you look at that and see how it would fit into a 
recommendation from OSIS for character counting? (Not sure I can reach 
it today.)

Thanks!

Patrick

Harry Plantinga wrote:

>Yesterday I posted a problem with counting characters in
>unicode, namely you can encode some accented characters in
>different ways that have different numbers of characters.
>
>Here is a proposed solution.
>
>1.  The _official_ character count is in a normalized version
>of the text, which uses minimal-length encodings of all
>characters.
>
>2.  For a given grain, e.g. @char:52(Hello world!), if the
>52nd character isn't the start of the string "Hello world!",
>point to the first occurrence of "Hello world!" after the 52nd
>character.
>
>3.  Recommend that when counting characters, don't count accents
>and other modifiers. This may underestimate the number of unicode
>characters slightly if there are some accented combinations that
>don't have a single-character representation, but in conjunction
>with (2) above, will normally give the right result. Especially
>if the string is unique.
>
>4.  For people who don't like counting characters and can identify
>unique strings, allow @char:0(Hello world!).  Actually, this
>is implied by 3 above. 
>
>5.  (Extra credit). Allow @(Hello world!) as a shortcut for
>@char:0(Hello world!).
>
>-Harry
>

-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
pdurusau@emory.edu