FW: [osis-core] character counting issue: proposed solution

Patrick Durusau osis-core@bibletechnologieswg.org
Fri, 21 Jun 2002 06:27:08 -0400


Steve,

Can you and Harry (and any one else who has comments on this issue) 
derive a consensus on the solution?

Suggestions for syntax would be welcome (particularly if simpler than 
that morass that I derived for the schema).

Should I hold off a new release today in hopes of issuing one tomorrow 
that should be purused for bug fixes? (That assumes we reach some 
agreement on this syntax issue and the additional header element (for 
the TEI header) today.)

Anyone up for discussions of the reference and pointing syntax?

Patrick

Steve DeRose wrote:

> At 10:04 AM -0400 06/19/02, Patrick Durusau wrote:
>
>> Harry,
>>
>> Thanks for the quick turn around on the analysis!
>>
>> Steve, since the grain proposal was yours originally, comments on 
>> Harry's proposal?
>>
>> Would make the syntax for the regex a little more manageable. (and 
>> punts on all the nasty normalization issues, but as Harry notes, will 
>> work most of the time.)
>>
>> So, drop use of character and use only string? plus occurrence marker?
>
>
> I thought of the string as merely a check; we could of course think of 
> the string as normative and include the number as a check; or not 
> include it (though if we don't have both, it's harder to catch errors 
> automatically); or make both optional so you can go either way.
>
> If we did the string + occurrence number thing, it would be uglier to 
> get point selections (nth occurrence of null string, maybe?), or short 
> targets like single characters or common words. It assumes the string 
> length is the same as the target length, which we didn't have to 
> assume before when we had a lenght (though I think we should have at 
> least made the *default* length match the string, if we hadn't said 
> that already). I think it still has the normalization problem just to 
> match the strings (though maybe not quite as bad; haven't thought it 
> all the way through yet).
>
> I'm inclined to suggest:
>
> a) change 'character' to 'code point' and explain that it's dumb.
>
> b) adopt Harry's method of looking forward upon finding mismatch.
>
> c) make the +length optional, and default it to the string length
>
> d) state that length 0 is a point selection before the nth char
>
> e) state that offsets start at 1 and can't be negative to count 
> backwards.
>
> f) state what happens if the offset or length goes beyond the content 
> of the referenced element we're counting in. Just copy the xpointer 
> rules on this, I suppose (now, if i could only remember what they 
> are...).
>
> g) perhaps? make offset optional in which case you get the string. eh.
>
> Does that cut a plausible compromise on well-defined counting vs. ease 
> of implementation? Any boundary cases left unspecified?
>
>
>>
>> Question: should we still allow an alternative "grain" syntax? (here 
>> thinking of people who want to embed XPath/XQuery for applications 
>> that support that sort of thing).
>>
>> Patrick
>>
>> Harry Plantinga wrote:
>>
>>> Patrick,
>>>
>>> I looked through the link below, and I think the issue is even more 
>>> difficult than I had realized. For example, ligatures can
>>> represent two characters with a single glyph and a single byte
>>> sequence. In fact, because of the various meanings of "character,"
>>> the w3.org web page recommends against using the term "character"
>>> at all, if possible.
>>>
>>> They do refer to "Unicode Normalized Form C", however, which ensures
>>> identical byte coding of the same set of characters. We could refer 
>>> to in our definition of character counting. But that would make 
>>> counting characters accurately very difficult--it would require 
>>> reading reams of information about Unicode Normalized
>>> Form C just to figure out if "First" is 5 characters or 4 (because
>>> of an Fi ligature).
>>>
>>> However, since w3 recommends against even using the term 
>>> "character", we may want to consider whether that term is too fuzzy 
>>> to be used in a definition of a grain identifier. Besides, it's a 
>>> nuisance (and somewhat error-prone) to count characters even if the 
>>> meaning were clear.
>>> N.B. the same sorts of problems arise in string matching; "First"
>>> may not match "First" if one of the strings uses a ligature.
>>>
>>> I believe that this sort of thing will be a rare problem but one
>>> that is very hard to solve correctly in all circumstances. The only 
>>> way to solve it correctly that I can think of is to insist
>>> that the texts and matching strings be in Unicode Normalized Form C,
>>> however arcane that may be.  (Then we can ignore the issue entirely
>>> and it will work right for us most of the time :-)
>>>
>>> So here's a revised proposal:
>>>
>>> - Drop the character count in the grain. Just use strings, with  an 
>>> optional parameter for the occurrence number of the string.
>>>
>>>  @(Hello world!)
>>>  @37(Hello world!)  (37th occurrence. Or use other syntax.)
>>>
>>> - Recognize that this will only work correctly if the strings  are 
>>> encoded the same way.
>>> -Harry
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: owner-osis-core@bibletechnologieswg.org
>>> [mailto:owner-osis-core@bibletechnologieswg.org]On Behalf Of Patrick
>>> Durusau
>>> Sent: Tuesday, June 18, 2002 4:02 PM
>>> To: osis-core@bibletechnologieswg.org
>>> Subject: Re: FW: [osis-core] character counting issue: proposed 
>>> solution
>>>
>>>
>>> Harry,
>>>
>>> I will be trying to work through your post and the W3C's position on 
>>> the character set model (http://www.w3.org/TR/charmod/). If you have 
>>> the time, can you look at that and see how it would fit into a 
>>> recommendation from OSIS for character counting? (Not sure I can 
>>> reach it today.)
>>>
>>> Thanks!
>>>
>>> Patrick
>>>
>>
>> -- 
>> Patrick Durusau
>> Director of Research and Development
>> Society of Biblical Literature
>> pdurusau@emory.edu
>
>
>

-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
pdurusau@emory.edu