FW: [osis-core] character counting issue: proposed solution

Steve DeRose osis-core@bibletechnologieswg.org
Wed, 19 Jun 2002 21:32:21 -0400


At 10:04 AM -0400 06/19/02, Patrick Durusau wrote:
>Harry,
>
>Thanks for the quick turn around on the analysis!
>
>Steve, since the grain proposal was yours originally, comments on 
>Harry's proposal?
>
>Would make the syntax for the regex a little more manageable. (and 
>punts on all the nasty normalization issues, but as Harry notes, 
>will work most of the time.)
>
>So, drop use of character and use only string? plus occurrence marker?

I thought of the string as merely a check; we could of course think 
of the string as normative and include the number as a check; or not 
include it (though if we don't have both, it's harder to catch errors 
automatically); or make both optional so you can go either way.

If we did the string + occurrence number thing, it would be uglier to 
get point selections (nth occurrence of null string, maybe?), or 
short targets like single characters or common words. It assumes the 
string length is the same as the target length, which we didn't have 
to assume before when we had a lenght (though I think we should have 
at least made the *default* length match the string, if we hadn't 
said that already). I think it still has the normalization problem 
just to match the strings (though maybe not quite as bad; haven't 
thought it all the way through yet).

I'm inclined to suggest:

a) change 'character' to 'code point' and explain that it's dumb.

b) adopt Harry's method of looking forward upon finding mismatch.

c) make the +length optional, and default it to the string length

d) state that length 0 is a point selection before the nth char

e) state that offsets start at 1 and can't be negative to count backwards.

f) state what happens if the offset or length goes beyond the content 
of the referenced element we're counting in. Just copy the xpointer 
rules on this, I suppose (now, if i could only remember what they 
are...).

g) perhaps? make offset optional in which case you get the string. eh.

Does that cut a plausible compromise on well-defined counting vs. 
ease of implementation? Any boundary cases left unspecified?


>
>Question: should we still allow an alternative "grain" syntax? (here 
>thinking of people who want to embed XPath/XQuery for applications 
>that support that sort of thing).
>
>Patrick
>
>Harry Plantinga wrote:
>
>>Patrick,
>>
>>I looked through the link below, and I think the issue is even more 
>>difficult than I had realized. For example, ligatures can
>>represent two characters with a single glyph and a single byte
>>sequence. In fact, because of the various meanings of "character,"
>>the w3.org web page recommends against using the term "character"
>>at all, if possible.
>>
>>They do refer to "Unicode Normalized Form C", however, which ensures
>>identical byte coding of the same set of characters. We could refer 
>>to in our definition of character counting. But that would make 
>>counting characters accurately very difficult--it would require 
>>reading reams of information about Unicode Normalized
>>Form C just to figure out if "First" is 5 characters or 4 (because
>>of an Fi ligature).
>>
>>However, since w3 recommends against even using the term 
>>"character", we may want to consider whether that term is too fuzzy 
>>to be used in a definition of a grain identifier. Besides, it's a 
>>nuisance (and somewhat error-prone) to count characters even if the 
>>meaning were clear.
>>N.B. the same sorts of problems arise in string matching; "First"
>>may not match "First" if one of the strings uses a ligature.
>>
>>I believe that this sort of thing will be a rare problem but one
>>that is very hard to solve correctly in all circumstances. The only 
>>way to solve it correctly that I can think of is to insist
>>that the texts and matching strings be in Unicode Normalized Form C,
>>however arcane that may be.  (Then we can ignore the issue entirely
>>and it will work right for us most of the time :-)
>>
>>So here's a revised proposal:
>>
>>- Drop the character count in the grain. Just use strings, with  an 
>>optional parameter for the occurrence number of the string.
>>
>>  @(Hello world!)
>>  @37(Hello world!)  (37th occurrence. Or use other syntax.)
>>
>>- Recognize that this will only work correctly if the strings  are 
>>encoded the same way.
>>-Harry
>>
>>
>>
>>
>>-----Original Message-----
>>From: owner-osis-core@bibletechnologieswg.org
>>[mailto:owner-osis-core@bibletechnologieswg.org]On Behalf Of Patrick
>>Durusau
>>Sent: Tuesday, June 18, 2002 4:02 PM
>>To: osis-core@bibletechnologieswg.org
>>Subject: Re: FW: [osis-core] character counting issue: proposed solution
>>
>>
>>Harry,
>>
>>I will be trying to work through your post and the W3C's position 
>>on the character set model (http://www.w3.org/TR/charmod/). If you 
>>have the time, can you look at that and see how it would fit into a 
>>recommendation from OSIS for character counting? (Not sure I can 
>>reach it today.)
>>
>>Thanks!
>>
>>Patrick
>>
>
>--
>Patrick Durusau
>Director of Research and Development
>Society of Biblical Literature
>pdurusau@emory.edu


-- 

Steve DeRose -- http://www.stg.brown.edu/~sjd
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose@speakeasy.net
Backup email: sderose@mac.com, sjd@stg.brown.edu