[sword-devel] XSLT vs. C++

Tue Nov 30 17:09:59 MST 2010

On Tue, Nov 30, 2010 at 1:08 PM, Troy A. Griffitts <scribe at crosswire.org> wrote:
> Having finally returned from a hectic 2 weeks of conferences, and lots
> to do before leaving for Christmas, I'm not sure I'm up for a heated,
> passionate debate about technologies right now, but by all means, please
> commence the public discussion.
>
> Let me start by saying that everyone (I believe) agrees that we would
> like to have an HTML output from the engine which is more generic and
> would allow CSS to be applied if a frontend would like to do this.
> Currently HTMLHREF output from the engine is used by the widest number
> of frontends (to my knowledge) and would benefit everyone involved by
> becoming much more generic. e.g.,
>
> <title> -> <h1>
> rather than
> <title> -> <b><br />
>
> <transChange type="added"> -> <span class="tcAdded">
> rather than
> <transChange type="added"> -> <i>
>
> etc.
>
> I believe this will solve a number of issues and possibly get the BT and
> MacSword teams onboard to using the same HTML output filters as the
> other projects involve (or at least subclassing them and using the
> majority of their functionality).

I think this is our pretty well accepted premise.  The current filters
stink to various degrees and currently no one is willing to step up
and tackle them.

>
>
> Now, as to the other issue of using XSLT internally in the engine to
> process OSIS -> HTML
>
> I will throw a few melons into the air for target practice, and let the
> shooting commence.
>
> _____________________________
> *Multiple Language*
>
> XSLT is a programming language in the same sense that C++ is a
> programming language.
>
> The SWORD Project C++ engine is written in C++.  It is not a Python
> engine; it is not a Perl engine; it is not a Java engine; it is C++.
>
> One might say, "Well, you can use XSLT from C++.  Doesn't JSword do this
> from Java?"  Well, yes, of course you can, and DM can comment, if he
> feels the desire to recommend his decision to encorporate an XSLT engine
> into the JSword logic flow.  But simply because one CAN doesn't mean one
> SHOULD.  We COULD encorporate a Perl text processing engine in our C++
> code, or an Awk processing engine...  that doesn't mean we SHOULD.  I'm
> sure some would say we SHOULD.  And obviously DM has thought he SHOULD
> encorporate XSLT processing for JSword, so I'm not intending to say it
> is a BAD decision, just that it is not a decision I would make; in the
> same way as our projects each chose C++ vs. Java to implement our objective.

If a developer is going to develop OSIS -> HTML filters, for instance,
we are already assuming they know OSIS and HTML.  OSIS is XML and HTML
is SGML (though most of our work is probably targetting a more
XML-dialect of HTML).  XSLT is also XML.  Formally, it is not even a
programming language, but just a set of formatting/processing
instructions in XML.

Any developer using XML who is worth their salt should at least be
familiar with the basics of XSL - they may not be a guru of XPath
expressions or have every attribute of XSL memorized - and would
probably expect a library which handles XML as its preferred input
method to utilize one of the standard XML processing methods.  I know
I'm not the only person who was surprised to look in the library
filters and see neither DOM, SAX nor XSLT technologies in use.  That
was when I first ran and hid.

Of course, this portion of the discussion is only relevant for the
from-OSIS filters.

>
> _______________________
> *XSLT better than C++*
>
> One might say, "well, XSLT is better suited to process XML than C++."
> That's a loaded and unquantified statement.
>
> Certainly the C++ language specification doesn't include facilities to
> easily process XML, but that doesn't mean a plethora of C++ libraries
> don't exists for assisting in this task.
>
> The SWORD engine includes classes like XMLTag and SWBasicFilter which
> implement a SAX processing model.
>
> The current filters do not all use SWBasicFilter, nor XMLTag.  They've
> been written over 15 years and many before these classes existed.  Some
> are ugly and need to be rewritten for readability, certainly.  But not
> necessarily in a different programming language.

XSLT being "better" is, yes, a matter of complete subjectivity.  And,
as I mentioned above, is only useful when our source is XML to begin
with.  For GBF or Plaintext sources, XSLT is clearly not even
applicable.

But the current C++ is so good that you seem the only person willing
to touch it.  Peter just mentioned he tried once and couldn't get it.
I have gone into the filters before with a singular goal in mind and
was able to produce my desired changes, but it was long, drawn-out and
painful.  Doing the same tasks in XSL would have taken me mere
seconds.  I know a few other people, at least, have said they would
know how to do a task if XSLT was used instead of C++.  Of course,
that is a hypothetical - I can't know that they would have done so,
but that was their claim at the time.

Our recent discussion about the use of the "n" attribute for footnotes
in ThML is a perfect example.  Maintaining the attribute in XSL would
have been a trivial task I could have handled in seconds.  Instead, it
required you, myself and Karl and took about 10 days to get fixed.
You had to alert Karl and me to presence of the attributes, I provided
him a preliminary patch to incorporate the values, then he had to
heavily modify the patch to operate correctly in non-ThML source and a
few other corner cases.  And, in the end, the fix is only in Xiphos'
code base - I would have to go through 2 of those three steps again in
Bibletime, BPBible, MacSword and any other applications I wanted to
see proper behavior in.  Alternatively I could tackle the filters -
but I'm not really inclined to do so.

Is XSLT "better"?  For me, it would be better because I could more
easily modify its behavior based on the fact that I know XML and could
easily locate the necessary processing directive.  For you, maybe not.
 Are there things you simply cannot do in XSL that C++ can? Yes.  IMO
the benefits of XSL outweigh the benefits of C++ for this task, but
you clearly disagree. :)  I would also say that DOM or SAX processing
would be better for all the same reasons - it shields the user from
having to see the XML parsing and handle inconsistencies in
whitespace, validation, etc and is still a decently well-known
technology among XML users (even if it's slightly less well-known than
XSL).  And with a DOM or SAX parser, you could still happily employ
the full power of C++.

>
> ________________________
> *COMPLEXITY*
>
> The task of enumerating all types of OSIS <title> tags, and deciding
> what to do with each, and how to classify all <title> tags from all
> possible OSIS documents into our enumeration is still going to be a
> complex task using XSLT.  <title> is a complex example, but certainly
> not the most complex.
>
> It is a tall task to generalize all elements of all documents from all
> publishers into one conceptual model with one chosen output for a
> frontend-- whether that be for an audience on the Desktop, web-based, or
> a handheld.
>
> The complex processing required by the engine will require long, complex
> XSLT-- which likely will encorporate callbacks to C++.  It will not be
> more simple-- only mixed language.

I could also argue that the XSL would not require a developer to
mentally filter out the code that just identifies and locates XML
elements and attributes and parses them from the code that transforms
them and generates the output.  Thus yes, it might include some
extension functions into C++ but it would be simpler.  And it would
also be more expressive.

The enumeration of every OSIS <title> tag is a moot point for the
decision.  You need to enumerate them all in C++ as well and decide
what to do with them.  That doesn't change in the XSL - just the
method used.  An XSL match along the lines of <xsl:template
match="title[@type=psalm]"> still has to be done in C++ with some sort
of if(tag.name() == "title && tag.attr("type") == "psalm") or whatever
the syntax is.  And that is assuming the current filter is using
XMLTag and isn't comparing character strings directly.

> _______________________
> *Semantic vs. Display*
>
> Some will say (and have), "well, let everything be display oriented and
> let the publisher decide".  Fine, then you lose 2 things: the ability to
> display differently per user preference, per display device; and you
> also give up the promise to actually do any interesting research on the
> text.  When you lose semantic markup, then you lose all interesting
> information about WHAT is being marked up.

I just want to be clear that I'm not advocating the use of display
over semantics as a general choice.  My statements are strictly based
around my specific task and the fact that OSIS support in SWORD and
the front ends is not as good as the support of ThML.  Largely this is
because most applications display in HTML and my required task is
framed entirely in terms of the presentation and display - not the
semantics.  I would love and prefer to use OSIS for this task, but I
simply cannot accomplish it with the state of SWORD at this time.

>
> _______________________
> *More than a Rending Engine*
>
> The SWORD C++ Engine is more than simply a text rendering engine-- it is
> a Biblical text research engine.
>
> If I'd like to know the morphology of word 3 in 2Thes 2.13 of the WHNU
> Greek text, the entire program to do such is:
>
> SWMgr library;
> SWModule *whnu = library.getModule("WHNU");
> whnu->setKey("2th.2.13");
> whnu->RenderText();
>
> cout << "The morphology of word three is: " <<
> whnu->getEntryAttributes()["Word"]["003"]["Morph"] << endl;
>
>
> That reads nice (at least in my opinion).  I don't need to know about
> XML, XSLT, care what markup the WHNU module uses, I don't even have to
> know how to make a SWORD filter.  The current filters do all the work of
> breaking out these attributes and making them available in a nice and
> interesting map.

I'd like to be clear again, that XSL would only be useful for material
already in OSIS formats (or in valid ThML - I think TEI is also an XML
format?).  I doubt many modules in ThML are strictly valid at their
import times, so XSL wouldn't be very useful, and GBF is a monster
unto itself.  Doing the above in XSL from an OSIS source would not be
much different in complexity than what you have listed there.

<xsl:template match="verse[@osisID='2thes.2.13']/w[@n=3]">
The morphology of word three is: <xsl:value-of select="@morph" />
</xsl:template>

Or something similar (my knowledge of exact OSIS attribute names and
values wanes and it's been two or three weeks since I wrote an XPath
expression).

Of course, the string processing portion of SWORD would continue to be
of great importance for any modules in GBF format or similar to bring
them into a useful form.  In that way, SWORD would continue to be more
than just a text rendering engine.  It would continue to offer all of
its features, its buffering from the system and from the format, its
indexing, its module fetching and storing, etc.

> ______________________
>
>
> And finally, if bullets aren't flying already, I'll stir the heat up with...
>
> XSLT sucks.  A good C++ programmer can do anything in C++ better than
> any XSLT programmer.
>
>
> :)

A C++ programmer can definitely do more, since C++ is actually a
programming language and XSLT is a set of processing instructions.
Better?  That depends on what the criteria is.  For me, in my current
role as a module creator, the use of C++ is not currently better
because it is less flexible and extensible.  For you, as the library
maintainer, perhaps C++ is better because it's what you are already
comfortable with and because it has largely been your hand in the
filters.

>
> *duck*
> Have fun.
>
> Troy
>
> PS.  In summary, I understand the current filters are sometimes overly
> complex and need cleanup, standardization, etc.  It comes down to the
> fact that they mostly work, and other things which don't get priority,
> so they don't get much attention.  But honestly, I think one might be
> oversimplifying the problem at hand without realizing it, if one simply
> thinks switching to XSLT will make things easier.

I think one is also oversimplifying the options.  My dreamlist is that
SWORD produce a well-formed, valid, complete OSIS document for an
arbitrary KeyList that I pass it with FMT_OSIS set.  That basically
boils down to getting the *OSIS filters up to snuff and standardized.
The second item on the list is a readily extensible mechanism for
SWORD outputting HTML from that OSIS.  If that choice is providing an
XSL stylesheet with the library, a C++ SAX processor that a front-end
can readily extend, a DOM interface that can be easily customized is
immaterial to me.  I like all three of those, and can easily
understand and extend all of them.

I think any of those technologies would be an improvement over all
in-house C++ for the second half of any such processing.  If we are
using XML in Open Source Software, let's leverage the work of others
who have happily given us permission to use their libraries!

--Greg