FW: [osis-core] character counting issue: proposed solution

Patrick Durusau osis-core@bibletechnologieswg.org
Wed, 19 Jun 2002 10:04:49 -0400


Harry,

Thanks for the quick turn around on the analysis!

Steve, since the grain proposal was yours originally, comments on 
Harry's proposal?

Would make the syntax for the regex a little more manageable. (and punts 
on all the nasty normalization issues, but as Harry notes, will work 
most of the time.)

So, drop use of character and use only string? plus occurrence marker?

Question: should we still allow an alternative "grain" syntax? (here 
thinking of people who want to embed XPath/XQuery for applications that 
support that sort of thing).

Patrick

Harry Plantinga wrote:

>Patrick,
>
>I looked through the link below, and I think the issue is even 
>more difficult than I had realized. For example, ligatures can
>represent two characters with a single glyph and a single byte
>sequence. In fact, because of the various meanings of "character,"
>the w3.org web page recommends against using the term "character"
>at all, if possible.
>
>They do refer to "Unicode Normalized Form C", however, which ensures
>identical byte coding of the same set of characters. We could 
>refer to in our definition of character counting. But that would 
>make counting characters accurately very difficult--it would 
>require reading reams of information about Unicode Normalized
>Form C just to figure out if "First" is 5 characters or 4 (because
>of an Fi ligature).
>
>However, since w3 recommends against even using the term "character", 
>we may want to consider whether that term is too fuzzy to be used 
>in a definition of a grain identifier. Besides, it's a nuisance 
>(and somewhat error-prone) to count characters even if the meaning 
>were clear. 
>
>N.B. the same sorts of problems arise in string matching; "First"
>may not match "First" if one of the strings uses a ligature.
>
>I believe that this sort of thing will be a rare problem but one
>that is very hard to solve correctly in all circumstances. The 
>only way to solve it correctly that I can think of is to insist
>that the texts and matching strings be in Unicode Normalized Form C,
>however arcane that may be.  (Then we can ignore the issue entirely
>and it will work right for us most of the time :-)
>
>So here's a revised proposal:
>
>- Drop the character count in the grain. Just use strings, with 
>  an optional parameter for the occurrence number of the string.
>
>  @(Hello world!)
>  @37(Hello world!)  (37th occurrence. Or use other syntax.)
>
>- Recognize that this will only work correctly if the strings 
>  are encoded the same way. 
>
>-Harry
>
>
>
>
>-----Original Message-----
>From: owner-osis-core@bibletechnologieswg.org
>[mailto:owner-osis-core@bibletechnologieswg.org]On Behalf Of Patrick
>Durusau
>Sent: Tuesday, June 18, 2002 4:02 PM
>To: osis-core@bibletechnologieswg.org
>Subject: Re: FW: [osis-core] character counting issue: proposed solution
>
>
>Harry,
>
>I will be trying to work through your post and the W3C's position on the 
>character set model (http://www.w3.org/TR/charmod/). If you have the 
>time, can you look at that and see how it would fit into a 
>recommendation from OSIS for character counting? (Not sure I can reach 
>it today.)
>
>Thanks!
>
>Patrick
>

-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
pdurusau@emory.edu