[sword-devel] HELP! Need your feedback on XML Markup Language

Patrick Durusau sword-devel@crosswire.org
Sat, 18 Aug 2001 05:40:39 -0400


Troy,

"Troy A. Griffitts" wrote:
> 
> You're such a smart alec! :)
> 

Sorry, could not resist! ;-) It has been a long week and I needed some
humor to lighten my day!

> Liked the paper, though it stretched my knowledge of XPointer and XSLT.
> 
> Might I suggest, rather than force the granularity to the smallest
> PCDATA in all associative meta hierarchies, that you designate one
> document the 'master'; let it raise it's hierarchy from a flat 'word' to
> something in which the other ancillary hierarchies _might_ have in
> common (e.g. module/testament/chapter/verse/word, or anything it
> wishes). Force key attributed to be unique for all levels (like you have
> done for 'word') in the master. This will allow a greatly reduced size
> and complexity of additional auxiliary hierarchies, and remove the
> redundant CDATA from all documents.
> 

Using the master document is one way to implement the analysis and for
some purposes, might be a more "efficient" one or serve other needs.
Both Matt and I are interested in situations where there is no "master"
text but rather the ability to query any "text" against any other.
Similar to the non-declaration of a "base text" (in the TEI critical
apparatus sense) for building a critical apparatus in TEI.

> To use your example:
> 
> Dub your mostly unchanged Pages document the 'master' (just for example
> purposes; any file could be dubbed the 'master', but it looks like we
> get the most benefit from this first choice).  I've added unique
> attributes-- per our 'master' document requirements, above-- throughout
> the document (l1[2]=l3, and l2[2]=l4:
> 

Be careful that the choice does not preclude other choices for the
"master" document. That may be a good choice for some purposes but as I
noted above, would be less useful in other situations.

One problem with the example (although I can see that it works for our
example) is the loss of information. The "membership" information that
we are recording is implied in conventional markup and I am not sure
that your example does not recapture some of that implication of
membership rather than recording it expressly. That is not necessarily a
bad thing, I can imagine works that use our technique for only portions
of a text.

> <pages>
>      <page id="p1">
>            <line id="l1">
>               <w id="w1">This</w>
>               <w id="w2">is</w>
>            </line>
>            <line id="l2">
>               <w id="w3">text</w>
>            </line>
>      </page>
>      <page id="p2">
>            <line id="l3">
>               <w id="w4">in</w>
>               <w id="w5">a</w>
>               <w id="w6">base</w>
>            </line>
>            <line id="l4">
>               <w id="w7">file</w>
>            </line>
>      </page>
> </pages>
> 
> 
> This allows your Text document to be reduced from:
> 
> <text>
>      <para id="p1">
>               <w id="w1">This</w>
>               <w id="w2">is</w>
>               <w id="w3">text</w>
>               <w id="w4">in</w>
>               <w id="w5">a</w>
>               <w id="w6">base</w>
>               <w id="w7">file</w>
>      </para>
> </text>
> 
> to:
> 
> <text>
>      <para id="p1">
>            <page id="p1" />
>            <page id="p2" />
>      </para>
> </text>
> 
> 
> Clauses from:
> 
> <clauses>
>         <clause id="c1">
>              <s>
>                  <w id="w1">This</w>
>              </s>
>              <p>
>                  <w id="w2">is</w>
>              </p>
>              <c>
>                  <w id="w3">text</w>
>              </c>
>              <a>
>                  <w id="w4">in</w>
>                  <w id="w5">a</w>
>                  <w id="w6">base</w>
>                  <w id="w7">file</w>
>              </a>
>         </clause>
> </clauses>
> 
> to:
> 
> <clauses>
>         <clause id="c1">
>              <s>
>                  <w id="w1" />
>              </s>
>              <p>
>                  <w id="w2" />
>              </p>
>              <c>
>                  <w id="w3" />
>              </c>
>              <a>
>                  <page id="p2" />
>              </a>
>         </clause>
> </clauses>
> 
> The saving in space we see here is minimal, but I believe it reduces
> error prone redundancy and provides a mechanism to potentially save
> exponentially on space.

We have discussed the use of entities to collapse common hierarchies for
processing. I am not sure how the Xalan DTM model handles redundant
information but we are planning (hopeful someone else beats us to it!)
to investigate the use of that model for this information.

Glad you liked the paper!

Patrick

> 
> Please ignore me if I'm may be way off base.  Just my 1/2 cent worth.
> 
>         -Troy.

-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
pdurusau@emory.edu