[sword-devel] Valid vs Best Practice XML

Chris Little chrislit at crosswire.org
Sun Sep 16 16:53:49 MST 2012


On 09/15/2012 10:26 AM, DM Smith wrote:
>
> On Sep 14, 2012, at 8:15 PM, Chris Little <chrislit at crosswire.org>
> wrote:
>
>>
>>
>> On 09/14/2012 01:02 PM, Greg Hellings wrote:
>>> So I've been debugging a module display problem in BibleTime. I
>>> mentioned it on IRC with Troy the other day but we weren't able
>>> to connect at the same time to discuss further. The issue has to
>>> do with paragraph tags - in osis2mod these tags are being
>>> converted from <p> to <div sID="someid" type="paragraph" />.
>>
>> This is extraordinarily bad. This is a change in semantics, because
>> <p> and <div type="paragraph"> are not semantically equivalent.
>
>
>>
>> <p> marks the type of paragraph we all probably think of first:
>> generally, a chunk of text with newlines before and after.
>>
>> <div type="paragraph"> marks a formal division within a text that
>> happens to be identified as a 'paragraph' and may consist of
>> multiple <p>-type paragraphs. Examples of these divisions are found
>> in many laws and the Catechism of the Catholic Church (which does
>> exist in OSIS form). Here's part 1, section 1, chapter 1, article
>> 1, paragraph 1 of the CCC:
>> http://www.vatican.va/archive/ENG0015/__P16.HTM. As you can see, it
>> consists of many <p>-type paragraphs but is a single <div
>> type="paragraph">-type paragraph.
>
> No where in the OSIS manual does it give any indication of a semantic
> difference.

The manual is, of course, not exhaustive. It doesn't actually say 
anything about <div type="paragraph">, and notably doesn't suggest that 
there is any alternative to <p> within the section on paragraphs.

Correct me if I'm wrong, but I don't believe there is any case anywhere 
within the OSIS spec that two distinct methods of marking a structure 
are semantically identical. So all of the following are semantically 
distinct:
<chapter> vs. <div type="chapter">
<p> vs. <div type="paragraph">
<l> vs. <lb/> vs. <milestone type="line">
<closer> vs. <div type="colophon">

It's possible there was some corner case that necessitated allowing two 
forms of markup for a single type of semantic structure, but I can't 
think of one and would hope there was a really good reason for allowing it.

The inclusion of <div type="paragraph"> in OSIS is quite possibly to be 
attributed to me since the Catechism of the Catholic Church was an early 
OSIS demo document I produced for ABS and presented at a conference at 
the University of San Francisco. It's still apparent to me that the 
value is necessary, in spite of the potentially confusing name.

>> Abhorrent though I consider milestoned <p/>, I think I would much
>> prefer to see us map <p>...</p> to <p sID=""/>...<p eID=""/> than
>> see us clobber the semantics of a defined <div> type.
>
> It may be abhorent from a module authoring perspective, but from a
> software perspective, it is needed. I think it is better than <div
> type="x-p" ...>.

Agreed.

> In OSIS the only container element that is not milestoneable is <p>.
> The goal of osis2mod is to create BCV where verse is the container.
>
> All SWORD/JSword software requires that a verse in isolation  can be
> meaningfully rendered. (for hit lists, verse lists, parallel view,
> cross-reference popups, ...)
>
> If we had a mode flag for SWORD and JSword that would indicate the
> scope (chapter or verse), then the render filter could do BSP for
> chapter and BCV for verse.
>
> I would rather see milestoned <p> too. However, it seems that the
> spec is not being maintained/updated. We have a page in the wiki with
> our recommendations for changes to the OSIS spec. How can we move
> them forward?
>
> I'd suggest that we maintain our own OSIS schema with the changes and
> fixes mentioned there and use that in our module validation.

To be clear on my perspective, I don't think milestoned <p> should 
become valid OSIS. I don't mind us violating the schema internally, but 
our needs for processing data don't necessitate that milestoned <p> be 
allowed in any OSIS document anywhere. But then again, I still believe 
the milstonability of <div> is a travesty.

It's maybe time to nudge all the OSIS principals again, to see if we can 
get things rolling. In lieu of that, I would recommend that we pick up 
the standard and fork it. There are bugs in the schema. There are 
various bits of USFM that have become standardized & need to be mirrored 
in OSIS to complete mapping. And obviously, there is a collection of 
reasonable improvements that have been suggested. If no one else will 
maintain the standard, we may as well.

>> I would agree that the filter output is buggy if we're generating
>> disallowed tag forms. OSIS <div> and <p> would need to be
>> translated to their correct, non-self-closing HTML forms. Beyond
>> those two, I can't think of any tags that have the same form &
>> general semantics in both OSIS & HTML.
>
> Table cells and list items are similar between OSIS and HTML:
> container elements that generally imply vertical whitespace.

<table> is another element that OSIS & HTML have in common, but I think 
that's the only common element pertaining to tables or lists. All the 
other elements for table & lists at least have different element names 
(e.g. OSIS <list> vs. HTML <ol>/<ul>).

The problem with exactly matching element names (<div>, <p>, & <table>) 
is that filter-writers are liable to be lazy and forget that they can't 
just ignore the difference in attributes. Rather, they're likely to pass 
such elements through the filter, leading to invalid attributes or 
invalid, self-closing elements.

--Chris



More information about the sword-devel mailing list