xml:whitespace: was Re: [osis-core] <hi> types

Patrick Durusau osis-core@bibletechnologieswg.org
Thu, 21 Aug 2003 06:39:11 -0400


Gentlemen, ;-)

Before this heats up much more, let's make sure we are on the same page.

Troy A. Griffitts wrote:
> Todd,
>     How would YOU suggest we force people to markup 2 spaces between 
> sentences?
>     2 spaces between STATE and ZIP in an address?
>     Extra spaces before GOD in Chinese?
>     Preserve TABs?
>     Preserve NewLines?
> 
>     How would YOU suggest we allow large amounts of data, like I have 
> suggested WON'T make it into OSIS and Harry seems to think the same, if 
> we FORCE the people marking up text to add all these in by hand? BETWEEN 
> EVERY SENTENCE (Whatever you propose, as we don't even have a &nbsp; 
> right now).
> 

In XML, the relevant attribute is xml:space which can have two values, 
default or preserve.

Note that an XML parser actually passes all characters (including 
whitespace) to the application that are not markup.

Or in the words of the XML 1.0 (2nd edition) spec:

> A special attribute named xml:space may be attached to an element to signal an intention that in that element, white space should be preserved by applications. In valid documents, this attribute, like any other, must be declared if it is used. When declared, it must be given as an enumerated type whose values are one or both of "default" and "preserve".

In other words, the XML parser does not "do" anything to the whitespace, 
but merely passes it along and gives the application notice of it, along 
with how it "should" be processed by the application.

There is no guarantee that the application will honor this intention. 
Note that browsers are a good example of applications that have default 
rules for handling whitespace.

Now, if you are using XSLT to transform the XML that has been passed 
along by the XML parser, which as noted, includes all the whitespace, 
there are two top level elements (both occur under <xsl:stylesheet>), 
<xsl:preserve-space> and <xsl:strip-space>.

Operate as their names suggest, but takes no notice of the signal from 
the XML parser to either preserve or default to application rules.

In other words, whether putting the attribute xml:space="preserve" will 
   have any impact on the processing of the whitespace in the content of 
that element depends upon the stylesheet (if you are using XSLT) or the 
application itself.

So, it is not simply an issue of putting the xml:space="preserve" 
attribute at the top of the XML document and rolling along. The 
resulting display will vary according to the stylesheet/application that 
is used with the text.


>     Wouldn't it be nice to take a LARGE volume of texts that aren't 
> worth spending the time to markup in detail, tack the 
> xml:whitespace="preserve" tag to the top, break it up into general 
> sections with osisID attributes and be done with it?
> 

Agree that we need a LARGE volume of texts in OSIS but am not at all 
certain that whitespace issues will have that great an impact one way or 
the other.

 From above:

How would YOU suggest we force people to markup 2 spaces between
 > sentences?

Why would I need "2 spaces between sentences?"

 >     2 spaces between STATE and ZIP in an address?

Or here?

 >     Extra spaces before GOD in Chinese?

Assume this is a rendering requirement? Suggest 
<divineName>God</divineName>, assuming you can mark occurrences with a 
script for imposition of the style.

 >     Preserve TABs?

Do you mean as in tables? That's an ugly problem. Seems like I saw a 
partial solution to that years ago, let me poke around in my SGML 
archives for a while.

 >     Preserve NewLines?

Not sure what you mean here?


Note that I don't think it is required that people markup every feature 
that we might want to have in an OSIS document. So long as they are 
consistent in their practices, I suspect a lot of markup can be inserted 
using scripts. Afterall, in the early days of markup there were no 
markup editors and any serious amount of text was converted using 
scripts. That has sort of fallen by the wayside, or at least is not 
discussed as much.

Could even create a category of OSIS texts that are "pre-OSIS" texts 
have some markup but would really be useful if they had a bit more. If 
you had a lot of text from a particular source, probably likely that 
someone would be interested in writing scripts to run across the entire 
collection.

Don't think we should dismiss your idea of large amounts of litely 
marked texts nor ignore Chris's suggestion that some cleanup is probably 
not that hard. It really isn't an either/or situation.

Suggestion: Can we find a place for migration to OSIS on next week's 
calendar? Perhaps an evening session? (No, I will not stay up to 
midnight like some people talked me into in Rome several years ago but 
can last until about 8:30-9:00 PM.) Thought Troy could give us examples 
of large amounts of litely encoded texts and Chris could suggest regexes 
that would make it more robust.

Hope everyone is at the start of a great day!

Patrick


>     -Troy.
> 
> 
> 
> Todd Tillinghast wrote:
> 
>> Troy,
>>
>> I think <hi> and xml:whitespace fall into two different categories.  I
>> think the discussion to date points away from the need for
>> xml:whitespace.
>>
>> Todd
>>
>>
>>> -----Original Message-----
>>> From: osis-core-admin@bibletechnologieswg.org [mailto:osis-core-
>>> admin@bibletechnologieswg.org] On Behalf Of Troy A. Griffitts
>>> Sent: Wednesday, August 20, 2003 3:32 PM
>>> To: osis-core@bibletechnologieswg.org
>>> Subject: Re: [osis-core] <hi> types
>>>
>>> So does that mean we intend to honor the xml:whitespace="preserve"
>>> attributed suggested by W3C?
>>>
>>> Patrick Durusau wrote:
>>>
>>>> Harry,
>>>>
>>>> Harry Plantinga wrote:
>>>>
>>>>
>>>>>> I am concerned that encoders using would use the presentation
>>>>>
>>
>> related
>>
>>>>>> elements RATHER THAN other elements.  (Ex <hi
>>>>>> type='smallCaps'>Lord</hi> rather than <divineName
>>>>>> type='yhwh'>Lord</divineName>, etc...)
>>>>>>
>>>>>> I do see a need for <hi> in non-Biblical texts.  If as Chris
>>>>>
>>
>> suggests
>>
>>>>>> we use <hi> to encode meaning and not presentation we will be
>>>>>
>>
>> better
>>
>>>>>> off. I would like to say away from type values of bold, italics,
>>>>>> etc... in favor of strongEmphasis, emphasis, etc...  I don't have
>>>>>
>>
>> a
>>
>>>>>> good suggestions for a comprehensive set of a type values.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I've seen this debate many times before and usually it is not
>>>>> settled to everyone's satisfaction. However, it is clear that
>>>>> there are times when italics, bold, etc. will be present in a text
>>>>
>>
>> and
>>
>>>>> will not be representable in any OSIS markup apart
>>>>> from something like <hi type="bold">.
>>>>>
>>>>
>>>> Say its not so, Harry! ;-)
>>>>
>>>>
>>>>> It is also clear to me that 95% of the time encoders are going
>>>>> to be unwilling to go through an old book and figure out
>>>>> what each instance of italicized text means when there is
>>>>> <hi type="italics"> available that meets 95% of people's usage
>>>>> needs.
>>>>>
>>>>> That is, everyone has a threshhold at which they say "I just
>>>>> mean italics, darnit!" but if italics is an available markup
>>>>> option, it'll be used much more than some will find desirable.
>>>>>
>>>>> But if there is no way of marking some text as 'italics', OSIS will
>>>>> not be usable for quick-and-dirty conversion of
>>>>> texts from one markup to another -- only for very laborious,
>>>>> hand-tuned markup. If that's what you want, go for it!
>>>>>
>>>>
>>>> I think Harry has the right of it, reluctantly, but I do. Getting
>>>
>>
>> large
>>
>>>> amounts of texts into some semblance of reasonable markup is
>>>
>>
>> difficult
>>
>>>> enough without insisting on practices that most encoders either
>>>
>>
>> aren't
>>
>>>> capable of following or won't. At best the material is unmarked
>>>> altogether, at worse they don't use the markup system at all.
>>>>
>>>> I would go with Chris's suggestion of common names, such as italic,
>>>> bold, etc., (yea, verily, presentation language) rather than less
>>>> intuitive alternatives.
>>>>
>>>> Actually we could begin to build NLP software with knowledge bases
>>>
>>
>> of
>>
>>>> terms, names, etc., that would allow some automated upgrading of
>>>
>>
>> less
>>
>>>> complex encoding.
>>>>
>>>> Hope everyone is having a great day!
>>>>
>>>> Patrick
>>>>
>>>>
>>>>> -Harry
>>>>>
>>>>> _______________________________________________
>>>>> osis-core mailing list
>>>>> osis-core@bibletechnologieswg.org
>>>>> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> osis-core mailing list
>>> osis-core@bibletechnologieswg.org
>>> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
>>
>>
>>
>> _______________________________________________
>> osis-core mailing list
>> osis-core@bibletechnologieswg.org
>> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
> 
> 
> _______________________________________________
> osis-core mailing list
> osis-core@bibletechnologieswg.org
> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
> 


-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
Patrick.Durusau@sbl-site.org
Chair, V1 - Text Processing: Office and Publishing Systems Interface
Co-Editor, ISO 13250, Topic Maps -- Reference Model

Topic Maps: Human, not artificial, intelligence at work!