[osis-core] whitespace

Todd Tillinghast osis-core@bibletechnologieswg.org
Fri, 8 Aug 2003 13:26:37 -0600


Troy,

I am with you on the two spaces between sentences.  I don't see this as
formatting but simply a part of the text.  Although I have not done this
myself, it would likely be a better practice to use the appropriate
entities for double spaces between sentences and for spaces after the
last word and before the </x> and before the first word and after the
<x>.  With this in place we won't have to use xml:space (other than for
presentation related purposes mentioned below).  I (surprise) would
prefer not to see the introduction and use of xml:space.

I am not sure if a text node is ignored if it is totally whitespace and
you have used xml:space="preserve".  If text nodes that are totally
whitespace are not ignored when xml:space="preserve" is set then the
line break and tabs/spaces used to indent the XML element would also be
preserved.  I am sure Patrick or Steve can easily provide clarity to
this point.

When it comes to line breaks they can be clearly expressed with an
element (like <br/> in HTML).  I believe we allow <lb/> just about
everywhere we have text for this very purpose.  It is also not a labor
intensive process to substitute an <lb/> element for every line break
character in a source text.

I do think that tabs used for indentation or to represent a table are
presentation related and should be replaced with markup.  This may not
require a significant manual encoding effort, but would likely require a
little more effort in the automated process that is generating the XML
from the source.

I agree that "suppressing whitespace is not equivalent to a 'higher
standard of quality'", but I do content that including tabs and line
breaks in an OSIS/XML document to represent an intended presentation
strategy represents a lower standard of quality and limits the usability
of the document.

> There has to be a limit as to what you are willing to strip in respect
> to whitespace.  We use it all the time. and in HTML, they preserve it
> with: &nbsp; and <br/>.

Neither &nbsp nor <br/> are considered whitespace.  

HTML is presentation targeted markup, so you would expect it to include
presentation/formatting related information.  

Although I would rather see no elements that even hint at formatting, I
have been convinced (mainly by Chris' irrefutable cases) that there are
a number of cases where the formatting in a printed edition needs to be
encoded using an element.

In any case I believe that whitespace that would be stripped out should
be replaced by either a entity or by an element.

> 
> At the risk of forcing everyone to pollute their documents with
> ill-chosen and irresponsible <p> tags to get the formatting they want,
I
> feel we need to address the whitespace issue.

I believe <p> should be used for paragraphs.  If there are cases where
you would be forced to use a <p> element (not an empty <p/>) to
represent the text you need to encode because the OSIS schema is
inadequate we should understand the case and adjust the schema.

If you are talking about "irresponsibly" using <p/> elements to get the
rendering application to produce a line break I agree with you.
However, I would most likely suggest that some sort of enclosing element
be used and that the rendering process should be constructed/instructed
so that the desired formatting is produced.  For example, there is not
need to put a <p/> or a <lb/> after a <title> in order to get the
desired spacing after the title.  In fact it would be a disservice to
include it.  There are some scripture renderings that start the section
title at the start of the line and then start the verse text on the same
line immediately following the title.  

Similarly it is not appropriate to put a line break between <lg>
elements in the Psalms.  That should be left up to the formatter.  A
rendering may choose to render each <lg> starts at the top of a page.  A
rendering may also present a <lg> all one line alternating the color of
the text to indicate the different lines of poetry or present a " * "
between lines of poetry.  

Or rather than indenting text, it might be presented in a smaller point
size or different type face.  

If the tabs and line break characters were left in the XML text element
then the text is fixed to a single presentation style.


Are you agreeable to put elements or entities in where you would be
tempted to leave in whitespace that would be removed by a parser AND to
not encoding pure presentation information? 

I would also content that best practice should go a step further to say
that enclosing elements should be used when possible.  For example a
list should be encoded using <list> rather than a sequence of text nodes
separated by <lb/>.

Todd