[sword-devel] XSLT vs. C++
dmsmith at crosswire.org
Wed Dec 1 11:08:56 MST 2010
Not so much regarding Troy's comment about Plato's Form. Rather about
the model that JSword uses. It is meant for illumination.
JSword converts ThML, GBF, PlainText and OSIS on a verse by verse basis
into well-defined fragments of XML. These fragments use the tags of
OSIS, but might not produce a valid fragment. For ease of explanation,
we say that it is converted into OSIS. If for some reason a verse in
ThML or OSIS is not well-formed, it is hacked by successively stripping
out xml parts until it parses or until only the text remains. This hack
is rather unfortunate and should be removed or improved. E.g. notes and
xrefs should never be inlined as plain text if they are marked up properly.
Though it can, JSword does not use XSLT on a verse by verse basis to
render a verse. Rather it gathers all the verses as XML fragments into
an XML document. Typically this is a chapter of verses, but it might
also be the set of verses returned from a search result, specified by
the user, or given as a cross-reference. JSword will also collect verses
from several modules into the document for parallel display.
It is this document that is rendered. How this document is rendered is
up to the application. It could use SAX. It could walk the DOM. But
Bible Desktop uses XSLT and many other JSword front-ends do so as well.
In answer to an earlier question, the XSLT is read once and reused for
all rendering of modules. It is way to expensive to do this frequently.
Once per run or only when the underlying file changes is sufficient.
An aspect that JSword dictates on a processor of the document. All
rendering/filtering happens within it. The BD style sheet is
parametrized for each render option. Using these it shows/hides notes,
xrefs, strongs, and morph; does verse per line; changes in the
representation of the verse number; and so forth.
There are several values in rendering a chapter as a whole. There are
many constructs that can include more than one verse. One can start a
tag in the middle of one verse and close it in another. If one only
rendered verse-by-verse the start and end might not be matched up
correctly. For example, SWORD's osishtmlhref filter has a quote stack
and a highlight stack. If a quote starts in one verse and ends in
another, the stack is reset going from one verse to another. So the
quote marks might not match up. (Note: osis2mod is aware of this
shortcoming and adjusts for it. However, if the module maker uses
imp2mod or vpl2mod it can happen). For the <hi> tag when an opening tag
is found, it is pushed on a stack (allowing for nesting). When an end
tag is found, the stack is consulted to see what it was the start tag
was. If it were bold then it closes bold, otherwise it closes italics.
However, if the stack is empty, it closes italics.
This spanning problem affects JSword's rendering of a collection of
arbitrary verses. A tag can be open in one verse, but because the verse
is not show in context, it is never closed.
There is also an advantage of using XSLT over SAX, it is not limited to
a single pass of the document. For example, this is used in Bible
Desktop to show margin notes.
Regarding TEI, JSword pretends it is OSIS. This is not a far stretch
since OSIS was influenced by TEI. The XSLT has a few entries to be able
to display key elements. Since TEI is rather open, and in flux, not all
of what we will use will be found in it. I haven't looked at it but
Chris has a TEI schema he uses for validation. That could be used to
improve the XSLT or for TEI modules to have their own XSLT.
Regarding ThML, JSword would do well to not convert it to OSIS but have
XSLT for it as well.
Regarding the speed of XSLT vs SAX vs SWORDs renderers. Except for
handhelds (pda, phone, ...) it is a moot point. I figure that 5-6 years
is the maximum useful lifespan of a computer. The processing power of a
computer in these years, even a netbook, is sufficient to run XSLT fast
enough over a chapter's worth of verses to satisfy end users. I have an
old 486, Windows 98 laptop with limited memory that runs it acceptably.
Even my OLPC (one laptop per child) is fast enough.
Beyond JSword and how it could be used in SWORD with out much change to
the current library:
I'm not sure, but I think any SWORD front-end can try out XSLT if they
like on OSIS documents using the osisosis.cpp filter. The filter does
not attempt to do too much except reconstruct verses. It might need to
be modified to output milestoned verse markers instead of the begin/end
tags it does now. Using begin/end tags makes the assumption that a verse
is a well-formed fragment. Just use it to "render" a chapter and then
pass that chapter to xslt.
I'm hearing that lots of people won't seriously look at XSLT. It has a
steep but short learning curve. Kind of like Perl. There are two basic
programming models using XSLT: one that understands the containment
model of the schema. The other handles the tags as they appear, not
caring whether the document is structured correctly. They have their
pros and cons. (BD's XSLT uses the latter model.) But there are more and
more systems that are using xpath notation and people are becoming more
familiar with it. I think the audience of users that fairly easily buy
into XSLT are those that work with XML and DOM all the time. This
includes web developers.
As more and more of our front-ends are targeting browser engines for
display, it is or will become feasible for the transformation to be done
directly by the browser. Today, all current browsers (IE, FireFox,
Safari, Chrome, WebKit, Opera) can directly do the transformations. For
an example see:
I imagine it is possible, but I don't know how to pass parameters to the
stylesheet when done this way.
I don't know if it works with embedded browsers (xulrunner, webkit, ie),
but I'd guess it does.
There may be no need for SWORD to have html render filters. Just
transform the module into well-formed xml, feed it to a browser along
with a stylesheet.
Some things are hard to do in XSLT. Some are not possible/feasible. Some
are way too slow. So there will always be a need for a pre-processor to
do some up front work. Or for the XSLT to call out to another program.
Hope this is helpful.
On 12/01/2010 07:20 AM, Troy A. Griffitts wrote:
> The logic to get from any Publisher Source Document to rendered HTML is
> a very complex task to solve.
> We conceptually create Plato's Form of, say, a Bible, and try to fit
> imperfect Publisher markup into this concept. A Bible has verses,
> headings between verses, chapter intros, footnotes, crossrefs, lemma
> information, etc.
> If we do not do this, then we become a PDF reader-- there are already
> PDF readers and we lose the ability to do Bible specific things with our
> software. For example, if we didn't normalize the concept of crossref
> across all Books, then we couldn't turn them on and off; we couldn't
> provide a crossref panel in the reader which fills according to which
> crossref is hovered over, etc. Same with notes, strongs, headings, etc.
> This causes us to impose our Form onto a publisher's text. I understand
> why some people may not like this, but it is very much to our end users'
> benefit that we do this. Without this, we become a web-browser or a PDF
> reader. Which are fine for their purpose, but we intend to provide
> common, familiar, and sometimes novel Bible study aides to our reader.
> The current processing model is dark magic and I apologize for this. It
> should be well documented and easy to modify. I will attempt to improve
> the dissemination of knowledge of exactly WHAT our Forms are, how we
> impose those Forms on publishers' texts and improve the documentation
> and code to help others understand and have the ability to improve the code.
> I'll attempt to post a few easy to swallow SWORD 101 classes in email,
> which will help us gather our thoughts and documents on how all this works.
> On 12/01/2010 12:09 AM, Greg Hellings wrote:
>> On Tue, Nov 30, 2010 at 1:08 PM, Troy A. Griffitts<scribe at crosswire.org> wrote:
>>> Having finally returned from a hectic 2 weeks of conferences, and lots
>>> to do before leaving for Christmas, I'm not sure I'm up for a heated,
>>> passionate debate about technologies right now, but by all means, please
>>> commence the public discussion.
>>> Let me start by saying that everyone (I believe) agrees that we would
>>> like to have an HTML output from the engine which is more generic and
>>> would allow CSS to be applied if a frontend would like to do this.
>>> Currently HTMLHREF output from the engine is used by the widest number
>>> of frontends (to my knowledge) and would benefit everyone involved by
>>> becoming much more generic. e.g.,
>>> <title> -> <h1>
>>> rather than
>>> <title> -> <b><br />
>>> <transChange type="added"> -> <span class="tcAdded">
>>> rather than
>>> <transChange type="added"> -> <i>
>>> I believe this will solve a number of issues and possibly get the BT and
>>> MacSword teams onboard to using the same HTML output filters as the
>>> other projects involve (or at least subclassing them and using the
>>> majority of their functionality).
>> I think this is our pretty well accepted premise. The current filters
>> stink to various degrees and currently no one is willing to step up
>> and tackle them.
>>> Now, as to the other issue of using XSLT internally in the engine to
>>> process OSIS -> HTML
>>> I will throw a few melons into the air for target practice, and let the
>>> shooting commence.
>>> *Multiple Language*
>>> XSLT is a programming language in the same sense that C++ is a
>>> programming language.
>>> The SWORD Project C++ engine is written in C++. It is not a Python
>>> engine; it is not a Perl engine; it is not a Java engine; it is C++.
>>> One might say, "Well, you can use XSLT from C++. Doesn't JSword do this
>>> from Java?" Well, yes, of course you can, and DM can comment, if he
>>> feels the desire to recommend his decision to encorporate an XSLT engine
>>> into the JSword logic flow. But simply because one CAN doesn't mean one
>>> SHOULD. We COULD encorporate a Perl text processing engine in our C++
>>> code, or an Awk processing engine... that doesn't mean we SHOULD. I'm
>>> sure some would say we SHOULD. And obviously DM has thought he SHOULD
>>> encorporate XSLT processing for JSword, so I'm not intending to say it
>>> is a BAD decision, just that it is not a decision I would make; in the
>>> same way as our projects each chose C++ vs. Java to implement our objective.
>> If a developer is going to develop OSIS -> HTML filters, for instance,
>> we are already assuming they know OSIS and HTML. OSIS is XML and HTML
>> is SGML (though most of our work is probably targetting a more
>> XML-dialect of HTML). XSLT is also XML. Formally, it is not even a
>> programming language, but just a set of formatting/processing
>> instructions in XML.
>> Any developer using XML who is worth their salt should at least be
>> familiar with the basics of XSL - they may not be a guru of XPath
>> expressions or have every attribute of XSL memorized - and would
>> probably expect a library which handles XML as its preferred input
>> method to utilize one of the standard XML processing methods. I know
>> I'm not the only person who was surprised to look in the library
>> filters and see neither DOM, SAX nor XSLT technologies in use. That
>> was when I first ran and hid.
>> Of course, this portion of the discussion is only relevant for the
>> from-OSIS filters.
>>> *XSLT better than C++*
>>> One might say, "well, XSLT is better suited to process XML than C++."
>>> That's a loaded and unquantified statement.
>>> Certainly the C++ language specification doesn't include facilities to
>>> easily process XML, but that doesn't mean a plethora of C++ libraries
>>> don't exists for assisting in this task.
>>> The SWORD engine includes classes like XMLTag and SWBasicFilter which
>>> implement a SAX processing model.
>>> The current filters do not all use SWBasicFilter, nor XMLTag. They've
>>> been written over 15 years and many before these classes existed. Some
>>> are ugly and need to be rewritten for readability, certainly. But not
>>> necessarily in a different programming language.
>> XSLT being "better" is, yes, a matter of complete subjectivity. And,
>> as I mentioned above, is only useful when our source is XML to begin
>> with. For GBF or Plaintext sources, XSLT is clearly not even
>> But the current C++ is so good that you seem the only person willing
>> to touch it. Peter just mentioned he tried once and couldn't get it.
>> I have gone into the filters before with a singular goal in mind and
>> was able to produce my desired changes, but it was long, drawn-out and
>> painful. Doing the same tasks in XSL would have taken me mere
>> seconds. I know a few other people, at least, have said they would
>> know how to do a task if XSLT was used instead of C++. Of course,
>> that is a hypothetical - I can't know that they would have done so,
>> but that was their claim at the time.
>> Our recent discussion about the use of the "n" attribute for footnotes
>> in ThML is a perfect example. Maintaining the attribute in XSL would
>> have been a trivial task I could have handled in seconds. Instead, it
>> required you, myself and Karl and took about 10 days to get fixed.
>> You had to alert Karl and me to presence of the attributes, I provided
>> him a preliminary patch to incorporate the values, then he had to
>> heavily modify the patch to operate correctly in non-ThML source and a
>> few other corner cases. And, in the end, the fix is only in Xiphos'
>> code base - I would have to go through 2 of those three steps again in
>> Bibletime, BPBible, MacSword and any other applications I wanted to
>> see proper behavior in. Alternatively I could tackle the filters -
>> but I'm not really inclined to do so.
>> Is XSLT "better"? For me, it would be better because I could more
>> easily modify its behavior based on the fact that I know XML and could
>> easily locate the necessary processing directive. For you, maybe not.
>> Are there things you simply cannot do in XSL that C++ can? Yes. IMO
>> the benefits of XSL outweigh the benefits of C++ for this task, but
>> you clearly disagree. :) I would also say that DOM or SAX processing
>> would be better for all the same reasons - it shields the user from
>> having to see the XML parsing and handle inconsistencies in
>> whitespace, validation, etc and is still a decently well-known
>> technology among XML users (even if it's slightly less well-known than
>> XSL). And with a DOM or SAX parser, you could still happily employ
>> the full power of C++.
>>> The task of enumerating all types of OSIS<title> tags, and deciding
>>> what to do with each, and how to classify all<title> tags from all
>>> possible OSIS documents into our enumeration is still going to be a
>>> complex task using XSLT.<title> is a complex example, but certainly
>>> not the most complex.
>>> It is a tall task to generalize all elements of all documents from all
>>> publishers into one conceptual model with one chosen output for a
>>> frontend-- whether that be for an audience on the Desktop, web-based, or
>>> a handheld.
>>> The complex processing required by the engine will require long, complex
>>> XSLT-- which likely will encorporate callbacks to C++. It will not be
>>> more simple-- only mixed language.
>> I could also argue that the XSL would not require a developer to
>> mentally filter out the code that just identifies and locates XML
>> elements and attributes and parses them from the code that transforms
>> them and generates the output. Thus yes, it might include some
>> extension functions into C++ but it would be simpler. And it would
>> also be more expressive.
>> The enumeration of every OSIS<title> tag is a moot point for the
>> decision. You need to enumerate them all in C++ as well and decide
>> what to do with them. That doesn't change in the XSL - just the
>> method used. An XSL match along the lines of<xsl:template
>> match="title[@type=psalm]"> still has to be done in C++ with some sort
>> of if(tag.name() == "title&& tag.attr("type") == "psalm") or whatever
>> the syntax is. And that is assuming the current filter is using
>> XMLTag and isn't comparing character strings directly.
>>> *Semantic vs. Display*
>>> Some will say (and have), "well, let everything be display oriented and
>>> let the publisher decide". Fine, then you lose 2 things: the ability to
>>> display differently per user preference, per display device; and you
>>> also give up the promise to actually do any interesting research on the
>>> text. When you lose semantic markup, then you lose all interesting
>>> information about WHAT is being marked up.
>> I just want to be clear that I'm not advocating the use of display
>> over semantics as a general choice. My statements are strictly based
>> around my specific task and the fact that OSIS support in SWORD and
>> the front ends is not as good as the support of ThML. Largely this is
>> because most applications display in HTML and my required task is
>> framed entirely in terms of the presentation and display - not the
>> semantics. I would love and prefer to use OSIS for this task, but I
>> simply cannot accomplish it with the state of SWORD at this time.
>>> *More than a Rending Engine*
>>> The SWORD C++ Engine is more than simply a text rendering engine-- it is
>>> a Biblical text research engine.
>>> If I'd like to know the morphology of word 3 in 2Thes 2.13 of the WHNU
>>> Greek text, the entire program to do such is:
>>> SWMgr library;
>>> SWModule *whnu = library.getModule("WHNU");
>>> cout<< "The morphology of word three is: "<<
>>> whnu->getEntryAttributes()["Word"]["003"]["Morph"]<< endl;
>>> That reads nice (at least in my opinion). I don't need to know about
>>> XML, XSLT, care what markup the WHNU module uses, I don't even have to
>>> know how to make a SWORD filter. The current filters do all the work of
>>> breaking out these attributes and making them available in a nice and
>>> interesting map.
>> I'd like to be clear again, that XSL would only be useful for material
>> already in OSIS formats (or in valid ThML - I think TEI is also an XML
>> format?). I doubt many modules in ThML are strictly valid at their
>> import times, so XSL wouldn't be very useful, and GBF is a monster
>> unto itself. Doing the above in XSL from an OSIS source would not be
>> much different in complexity than what you have listed there.
>> <xsl:template match="verse[@osisID='2thes.2.13']/w[@n=3]">
>> The morphology of word three is:<xsl:value-of select="@morph" />
>> Or something similar (my knowledge of exact OSIS attribute names and
>> values wanes and it's been two or three weeks since I wrote an XPath
>> Of course, the string processing portion of SWORD would continue to be
>> of great importance for any modules in GBF format or similar to bring
>> them into a useful form. In that way, SWORD would continue to be more
>> than just a text rendering engine. It would continue to offer all of
>> its features, its buffering from the system and from the format, its
>> indexing, its module fetching and storing, etc.
>>> And finally, if bullets aren't flying already, I'll stir the heat up with...
>>> XSLT sucks. A good C++ programmer can do anything in C++ better than
>>> any XSLT programmer.
>> A C++ programmer can definitely do more, since C++ is actually a
>> programming language and XSLT is a set of processing instructions.
>> Better? That depends on what the criteria is. For me, in my current
>> role as a module creator, the use of C++ is not currently better
>> because it is less flexible and extensible. For you, as the library
>> maintainer, perhaps C++ is better because it's what you are already
>> comfortable with and because it has largely been your hand in the
>>> Have fun.
>>> PS. In summary, I understand the current filters are sometimes overly
>>> complex and need cleanup, standardization, etc. It comes down to the
>>> fact that they mostly work, and other things which don't get priority,
>>> so they don't get much attention. But honestly, I think one might be
>>> oversimplifying the problem at hand without realizing it, if one simply
>>> thinks switching to XSLT will make things easier.
>> I think one is also oversimplifying the options. My dreamlist is that
>> SWORD produce a well-formed, valid, complete OSIS document for an
>> arbitrary KeyList that I pass it with FMT_OSIS set. That basically
>> boils down to getting the *OSIS filters up to snuff and standardized.
>> The second item on the list is a readily extensible mechanism for
>> SWORD outputting HTML from that OSIS. If that choice is providing an
>> XSL stylesheet with the library, a C++ SAX processor that a front-end
>> can readily extend, a DOM interface that can be easily customized is
>> immaterial to me. I like all three of those, and can easily
>> understand and extend all of them.
>> I think any of those technologies would be an improvement over all
>> in-house C++ for the second half of any such processing. If we are
>> using XML in Open Source Software, let's leverage the work of others
>> who have happily given us permission to use their libraries!
>> sword-devel mailing list: sword-devel at crosswire.org
>> Instructions to unsubscribe/change your settings at above page
> sword-devel mailing list: sword-devel at crosswire.org
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel