whitespace: was Re: [osis-core] <hi> types

Todd Tillinghast osis-core@bibletechnologieswg.org
Thu, 21 Aug 2003 11:04:32 -0600


I think this is an important issue that we need to come to a consensus
before we can really work on other best practice issues.  Why don't we
address this in the section of the agenda allocated to issues related to
non-Biblical text in general?

Todd

> -----Original Message-----
> From: osis-core-admin@bibletechnologieswg.org [mailto:osis-core-
> admin@bibletechnologieswg.org] On Behalf Of Patrick Durusau
> Sent: Thursday, August 21, 2003 4:39 AM
> To: osis-core@bibletechnologieswg.org
> Subject: xml:whitespace: was Re: [osis-core] <hi> types
> 
> Gentlemen, ;-)
> 
> Before this heats up much more, let's make sure we are on the same
page.
> 
> Troy A. Griffitts wrote:
> > Todd,
> >     How would YOU suggest we force people to markup 2 spaces between
> > sentences?
> >     2 spaces between STATE and ZIP in an address?
> >     Extra spaces before GOD in Chinese?
> >     Preserve TABs?
> >     Preserve NewLines?
> >
> >     How would YOU suggest we allow large amounts of data, like I
have
> > suggested WON'T make it into OSIS and Harry seems to think the same,
if
> > we FORCE the people marking up text to add all these in by hand?
BETWEEN
> > EVERY SENTENCE (Whatever you propose, as we don't even have a &nbsp;
> > right now).
> >
> 
> In XML, the relevant attribute is xml:space which can have two values,
> default or preserve.
> 
> Note that an XML parser actually passes all characters (including
> whitespace) to the application that are not markup.
> 
> Or in the words of the XML 1.0 (2nd edition) spec:
> 
> > A special attribute named xml:space may be attached to an element to
> signal an intention that in that element, white space should be
preserved
> by applications. In valid documents, this attribute, like any other,
must
> be declared if it is used. When declared, it must be given as an
> enumerated type whose values are one or both of "default" and
"preserve".
> 
> In other words, the XML parser does not "do" anything to the
whitespace,
> but merely passes it along and gives the application notice of it,
along
> with how it "should" be processed by the application.
> 
> There is no guarantee that the application will honor this intention.
> Note that browsers are a good example of applications that have
default
> rules for handling whitespace.
> 
> Now, if you are using XSLT to transform the XML that has been passed
> along by the XML parser, which as noted, includes all the whitespace,
> there are two top level elements (both occur under <xsl:stylesheet>),
> <xsl:preserve-space> and <xsl:strip-space>.
> 
> Operate as their names suggest, but takes no notice of the signal from
> the XML parser to either preserve or default to application rules.
> 
> In other words, whether putting the attribute xml:space="preserve"
will
>    have any impact on the processing of the whitespace in the content
of
> that element depends upon the stylesheet (if you are using XSLT) or
the
> application itself.
> 
> So, it is not simply an issue of putting the xml:space="preserve"
> attribute at the top of the XML document and rolling along. The
> resulting display will vary according to the stylesheet/application
that
> is used with the text.
> 
> 
> >     Wouldn't it be nice to take a LARGE volume of texts that aren't
> > worth spending the time to markup in detail, tack the
> > xml:whitespace="preserve" tag to the top, break it up into general
> > sections with osisID attributes and be done with it?
> >
> 
> Agree that we need a LARGE volume of texts in OSIS but am not at all
> certain that whitespace issues will have that great an impact one way
or
> the other.
> 
>  From above:
> 
> How would YOU suggest we force people to markup 2 spaces between
>  > sentences?
> 
> Why would I need "2 spaces between sentences?"
> 
>  >     2 spaces between STATE and ZIP in an address?
> 
> Or here?
> 
>  >     Extra spaces before GOD in Chinese?
> 
> Assume this is a rendering requirement? Suggest
> <divineName>God</divineName>, assuming you can mark occurrences with a
> script for imposition of the style.
> 
>  >     Preserve TABs?
> 
> Do you mean as in tables? That's an ugly problem. Seems like I saw a
> partial solution to that years ago, let me poke around in my SGML
> archives for a while.
> 
>  >     Preserve NewLines?
> 
> Not sure what you mean here?
> 
> 
> Note that I don't think it is required that people markup every
feature
> that we might want to have in an OSIS document. So long as they are
> consistent in their practices, I suspect a lot of markup can be
inserted
> using scripts. Afterall, in the early days of markup there were no
> markup editors and any serious amount of text was converted using
> scripts. That has sort of fallen by the wayside, or at least is not
> discussed as much.
> 
> Could even create a category of OSIS texts that are "pre-OSIS" texts
> have some markup but would really be useful if they had a bit more. If
> you had a lot of text from a particular source, probably likely that
> someone would be interested in writing scripts to run across the
entire
> collection.
> 
> Don't think we should dismiss your idea of large amounts of litely
> marked texts nor ignore Chris's suggestion that some cleanup is
probably
> not that hard. It really isn't an either/or situation.
> 
> Suggestion: Can we find a place for migration to OSIS on next week's
> calendar? Perhaps an evening session? (No, I will not stay up to
> midnight like some people talked me into in Rome several years ago but
> can last until about 8:30-9:00 PM.) Thought Troy could give us
examples
> of large amounts of litely encoded texts and Chris could suggest
regexes
> that would make it more robust.
> 
> Hope everyone is at the start of a great day!
> 
> Patrick
> 
> 
> >     -Troy.
> >
> >
> >
> > Todd Tillinghast wrote:
> >
> >> Troy,
> >>
> >> I think <hi> and xml:whitespace fall into two different categories.
I
> >> think the discussion to date points away from the need for
> >> xml:whitespace.
> >>
> >> Todd
> >>
> >>
> >>> -----Original Message-----
> >>> From: osis-core-admin@bibletechnologieswg.org [mailto:osis-core-
> >>> admin@bibletechnologieswg.org] On Behalf Of Troy A. Griffitts
> >>> Sent: Wednesday, August 20, 2003 3:32 PM
> >>> To: osis-core@bibletechnologieswg.org
> >>> Subject: Re: [osis-core] <hi> types
> >>>
> >>> So does that mean we intend to honor the xml:whitespace="preserve"
> >>> attributed suggested by W3C?
> >>>
> >>> Patrick Durusau wrote:
> >>>
> >>>> Harry,
> >>>>
> >>>> Harry Plantinga wrote:
> >>>>
> >>>>
> >>>>>> I am concerned that encoders using would use the presentation
> >>>>>
> >>
> >> related
> >>
> >>>>>> elements RATHER THAN other elements.  (Ex <hi
> >>>>>> type='smallCaps'>Lord</hi> rather than <divineName
> >>>>>> type='yhwh'>Lord</divineName>, etc...)
> >>>>>>
> >>>>>> I do see a need for <hi> in non-Biblical texts.  If as Chris
> >>>>>
> >>
> >> suggests
> >>
> >>>>>> we use <hi> to encode meaning and not presentation we will be
> >>>>>
> >>
> >> better
> >>
> >>>>>> off. I would like to say away from type values of bold,
italics,
> >>>>>> etc... in favor of strongEmphasis, emphasis, etc...  I don't
have
> >>>>>
> >>
> >> a
> >>
> >>>>>> good suggestions for a comprehensive set of a type values.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> I've seen this debate many times before and usually it is not
> >>>>> settled to everyone's satisfaction. However, it is clear that
> >>>>> there are times when italics, bold, etc. will be present in a
text
> >>>>
> >>
> >> and
> >>
> >>>>> will not be representable in any OSIS markup apart
> >>>>> from something like <hi type="bold">.
> >>>>>
> >>>>
> >>>> Say its not so, Harry! ;-)
> >>>>
> >>>>
> >>>>> It is also clear to me that 95% of the time encoders are going
> >>>>> to be unwilling to go through an old book and figure out
> >>>>> what each instance of italicized text means when there is
> >>>>> <hi type="italics"> available that meets 95% of people's usage
> >>>>> needs.
> >>>>>
> >>>>> That is, everyone has a threshhold at which they say "I just
> >>>>> mean italics, darnit!" but if italics is an available markup
> >>>>> option, it'll be used much more than some will find desirable.
> >>>>>
> >>>>> But if there is no way of marking some text as 'italics', OSIS
will
> >>>>> not be usable for quick-and-dirty conversion of
> >>>>> texts from one markup to another -- only for very laborious,
> >>>>> hand-tuned markup. If that's what you want, go for it!
> >>>>>
> >>>>
> >>>> I think Harry has the right of it, reluctantly, but I do. Getting
> >>>
> >>
> >> large
> >>
> >>>> amounts of texts into some semblance of reasonable markup is
> >>>
> >>
> >> difficult
> >>
> >>>> enough without insisting on practices that most encoders either
> >>>
> >>
> >> aren't
> >>
> >>>> capable of following or won't. At best the material is unmarked
> >>>> altogether, at worse they don't use the markup system at all.
> >>>>
> >>>> I would go with Chris's suggestion of common names, such as
italic,
> >>>> bold, etc., (yea, verily, presentation language) rather than less
> >>>> intuitive alternatives.
> >>>>
> >>>> Actually we could begin to build NLP software with knowledge
bases
> >>>
> >>
> >> of
> >>
> >>>> terms, names, etc., that would allow some automated upgrading of
> >>>
> >>
> >> less
> >>
> >>>> complex encoding.
> >>>>
> >>>> Hope everyone is having a great day!
> >>>>
> >>>> Patrick
> >>>>
> >>>>
> >>>>> -Harry
> >>>>>
> >>>>> _______________________________________________
> >>>>> osis-core mailing list
> >>>>> osis-core@bibletechnologieswg.org
> >>>>> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
> >>>>>
> >>>>
> >>>>
> >>> _______________________________________________
> >>> osis-core mailing list
> >>> osis-core@bibletechnologieswg.org
> >>> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
> >>
> >>
> >>
> >> _______________________________________________
> >> osis-core mailing list
> >> osis-core@bibletechnologieswg.org
> >> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
> >
> >
> > _______________________________________________
> > osis-core mailing list
> > osis-core@bibletechnologieswg.org
> > http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
> >
> 
> 
> --
> Patrick Durusau
> Director of Research and Development
> Society of Biblical Literature
> Patrick.Durusau@sbl-site.org
> Chair, V1 - Text Processing: Office and Publishing Systems Interface
> Co-Editor, ISO 13250, Topic Maps -- Reference Model
> 
> Topic Maps: Human, not artificial, intelligence at work!
> 
> 
> _______________________________________________
> osis-core mailing list
> osis-core@bibletechnologieswg.org
> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core