[sword-devel] XSLT vs. C++
dmsmith at crosswire.org
Wed Dec 1 07:34:29 MST 2010
I like Plato's Chair analogy. But not the conclusions drawn from it.
I think we all agree that some level of structural markup is necessary to identify: books, chapters, verses, titles, intros, words of Christ, footnotes, cross-references, and anything else we might want to treat specially beyond just presentation.
I like deep structural markup that goes beyond what we currently use, e.g. markup of names and place names, so that we are not limited by what we have done, but what we can envision later.
Some structural markup, such as poetry markup, today is used as merely presentational. As a result, it often is not structurally meaningful. This is a problem of the module maker creating something that looks nice but of which there is no value to software processing (e.g. getNextPoetryBlock() just won't get the desired results.
The problem with the Plato's Chair analogy is that SWORD is not merely an idea, but *an* implementation of that chair. I'd say it looks rather like a 1980's dinette chair constructed of steel tubing and vinyl cushions.
The biggest problem I see with the modules and the filters is that they are lossy and/or incomplete. I'll keep my remarks to the OSIS process as that is what I am most familiar, and since it is *a* chair, it is not too far removed from ThML's chair.
Regarding the modules, of necessity, we transform BSP OSIS (aka Book, Section, Paragraph with verse markers) into BCV (Book, Chapter, Verse) without verse markers. (ThML, GBF, PlainText readily lend themselves to BCV directly. I'm going to guess that is *a* major motivation for ThML.)
The purpose of osis2mod is to transform the publishers' chairs into SWORD's chairs. The shortcoming of using IMP or VPL to import OSIS (or any other module type) is that it bypasses such a transformation and puts the burden on the module maker to construct SWORD's chair directly.
Regarding the filters, there is an agreement that they need help. The problem with the OSIS to HTML filters is that they are not written to display what is defined by the OSIS spec, but only what the filter author thought was important. Some examples: OSIS allows for a title to be within a title, that is, to have sub-titles. OSIS allows rich markup within titles, such as footnotes, cross-references, divine name, etc. OSIS allows for significant content between verses. Words of Christ in verses can be punctuated by other words. These were or are problematic to these filters.
The second problem with these filters is that they are lossy. The filters only look for a subset of the OSIS tags and attributes. Examples: the "n" attribute on footnotes. Of the various types of <hi> bold is handled well, but everything else gets italic (line-through, acrostic, illuminated, small-caps, sub, super). Table, row and cell are ignored (these could easily be in genbooks). And lots more....
This is a community effort and we all have different skill sets. I'm particularly weak in doing C++ coding as I have been away from it for too long (I started with C++ 1.0 and moved to something else just before 3.0 was released). Otherwise, I'd have tackled the lossy-ness of the filters.
As I look at the code, the essential part of the SWORD chair seems to be how it pulls out of line various components into easily addressed structures: titles, footnotes, ..... I've tried but I don't understand this at all.
Within the osishtmlhref filter there are various notions that are necessary to understand but are entirely baffling to me: suspendTextPassThru, suspendLevel, lastSuspendSegment, supressAdjacentWhitespace, <!P>, <!/P>.
So, if one were to write a new OSIS filter from scratch, I'd like to know what has to be done to meet/match SWORD's ideal chair.
On Dec 1, 2010, at 7:20 AM, Troy A. Griffitts wrote:
> The logic to get from any Publisher Source Document to rendered HTML is
> a very complex task to solve.
> We conceptually create Plato's Form of, say, a Bible, and try to fit
> imperfect Publisher markup into this concept. A Bible has verses,
> headings between verses, chapter intros, footnotes, crossrefs, lemma
> information, etc.
> If we do not do this, then we become a PDF reader-- there are already
> PDF readers and we lose the ability to do Bible specific things with our
> software. For example, if we didn't normalize the concept of crossref
> across all Books, then we couldn't turn them on and off; we couldn't
> provide a crossref panel in the reader which fills according to which
> crossref is hovered over, etc. Same with notes, strongs, headings, etc.
> This causes us to impose our Form onto a publisher's text. I understand
> why some people may not like this, but it is very much to our end users'
> benefit that we do this. Without this, we become a web-browser or a PDF
> reader. Which are fine for their purpose, but we intend to provide
> common, familiar, and sometimes novel Bible study aides to our reader.
> The current processing model is dark magic and I apologize for this. It
> should be well documented and easy to modify. I will attempt to improve
> the dissemination of knowledge of exactly WHAT our Forms are, how we
> impose those Forms on publishers' texts and improve the documentation
> and code to help others understand and have the ability to improve the code.
> I'll attempt to post a few easy to swallow SWORD 101 classes in email,
> which will help us gather our thoughts and documents on how all this works.
> On 12/01/2010 12:09 AM, Greg Hellings wrote:
>> On Tue, Nov 30, 2010 at 1:08 PM, Troy A. Griffitts <scribe at crosswire.org> wrote:
>>> Having finally returned from a hectic 2 weeks of conferences, and lots
>>> to do before leaving for Christmas, I'm not sure I'm up for a heated,
>>> passionate debate about technologies right now, but by all means, please
>>> commence the public discussion.
>>> Let me start by saying that everyone (I believe) agrees that we would
>>> like to have an HTML output from the engine which is more generic and
>>> would allow CSS to be applied if a frontend would like to do this.
>>> Currently HTMLHREF output from the engine is used by the widest number
>>> of frontends (to my knowledge) and would benefit everyone involved by
>>> becoming much more generic. e.g.,
>>> <title> -> <h1>
>>> rather than
>>> <title> -> <b><br />
>>> <transChange type="added"> -> <span class="tcAdded">
>>> rather than
>>> <transChange type="added"> -> <i>
>>> I believe this will solve a number of issues and possibly get the BT and
>>> MacSword teams onboard to using the same HTML output filters as the
>>> other projects involve (or at least subclassing them and using the
>>> majority of their functionality).
>> I think this is our pretty well accepted premise. The current filters
>> stink to various degrees and currently no one is willing to step up
>> and tackle them.
>>> Now, as to the other issue of using XSLT internally in the engine to
>>> process OSIS -> HTML
>>> I will throw a few melons into the air for target practice, and let the
>>> shooting commence.
>>> *Multiple Language*
>>> XSLT is a programming language in the same sense that C++ is a
>>> programming language.
>>> The SWORD Project C++ engine is written in C++. It is not a Python
>>> engine; it is not a Perl engine; it is not a Java engine; it is C++.
>>> One might say, "Well, you can use XSLT from C++. Doesn't JSword do this
>>> from Java?" Well, yes, of course you can, and DM can comment, if he
>>> feels the desire to recommend his decision to encorporate an XSLT engine
>>> into the JSword logic flow. But simply because one CAN doesn't mean one
>>> SHOULD. We COULD encorporate a Perl text processing engine in our C++
>>> code, or an Awk processing engine... that doesn't mean we SHOULD. I'm
>>> sure some would say we SHOULD. And obviously DM has thought he SHOULD
>>> encorporate XSLT processing for JSword, so I'm not intending to say it
>>> is a BAD decision, just that it is not a decision I would make; in the
>>> same way as our projects each chose C++ vs. Java to implement our objective.
>> If a developer is going to develop OSIS -> HTML filters, for instance,
>> we are already assuming they know OSIS and HTML. OSIS is XML and HTML
>> is SGML (though most of our work is probably targetting a more
>> XML-dialect of HTML). XSLT is also XML. Formally, it is not even a
>> programming language, but just a set of formatting/processing
>> instructions in XML.
>> Any developer using XML who is worth their salt should at least be
>> familiar with the basics of XSL - they may not be a guru of XPath
>> expressions or have every attribute of XSL memorized - and would
>> probably expect a library which handles XML as its preferred input
>> method to utilize one of the standard XML processing methods. I know
>> I'm not the only person who was surprised to look in the library
>> filters and see neither DOM, SAX nor XSLT technologies in use. That
>> was when I first ran and hid.
>> Of course, this portion of the discussion is only relevant for the
>> from-OSIS filters.
>>> *XSLT better than C++*
>>> One might say, "well, XSLT is better suited to process XML than C++."
>>> That's a loaded and unquantified statement.
>>> Certainly the C++ language specification doesn't include facilities to
>>> easily process XML, but that doesn't mean a plethora of C++ libraries
>>> don't exists for assisting in this task.
>>> The SWORD engine includes classes like XMLTag and SWBasicFilter which
>>> implement a SAX processing model.
>>> The current filters do not all use SWBasicFilter, nor XMLTag. They've
>>> been written over 15 years and many before these classes existed. Some
>>> are ugly and need to be rewritten for readability, certainly. But not
>>> necessarily in a different programming language.
>> XSLT being "better" is, yes, a matter of complete subjectivity. And,
>> as I mentioned above, is only useful when our source is XML to begin
>> with. For GBF or Plaintext sources, XSLT is clearly not even
>> But the current C++ is so good that you seem the only person willing
>> to touch it. Peter just mentioned he tried once and couldn't get it.
>> I have gone into the filters before with a singular goal in mind and
>> was able to produce my desired changes, but it was long, drawn-out and
>> painful. Doing the same tasks in XSL would have taken me mere
>> seconds. I know a few other people, at least, have said they would
>> know how to do a task if XSLT was used instead of C++. Of course,
>> that is a hypothetical - I can't know that they would have done so,
>> but that was their claim at the time.
>> Our recent discussion about the use of the "n" attribute for footnotes
>> in ThML is a perfect example. Maintaining the attribute in XSL would
>> have been a trivial task I could have handled in seconds. Instead, it
>> required you, myself and Karl and took about 10 days to get fixed.
>> You had to alert Karl and me to presence of the attributes, I provided
>> him a preliminary patch to incorporate the values, then he had to
>> heavily modify the patch to operate correctly in non-ThML source and a
>> few other corner cases. And, in the end, the fix is only in Xiphos'
>> code base - I would have to go through 2 of those three steps again in
>> Bibletime, BPBible, MacSword and any other applications I wanted to
>> see proper behavior in. Alternatively I could tackle the filters -
>> but I'm not really inclined to do so.
>> Is XSLT "better"? For me, it would be better because I could more
>> easily modify its behavior based on the fact that I know XML and could
>> easily locate the necessary processing directive. For you, maybe not.
>> Are there things you simply cannot do in XSL that C++ can? Yes. IMO
>> the benefits of XSL outweigh the benefits of C++ for this task, but
>> you clearly disagree. :) I would also say that DOM or SAX processing
>> would be better for all the same reasons - it shields the user from
>> having to see the XML parsing and handle inconsistencies in
>> whitespace, validation, etc and is still a decently well-known
>> technology among XML users (even if it's slightly less well-known than
>> XSL). And with a DOM or SAX parser, you could still happily employ
>> the full power of C++.
>>> The task of enumerating all types of OSIS <title> tags, and deciding
>>> what to do with each, and how to classify all <title> tags from all
>>> possible OSIS documents into our enumeration is still going to be a
>>> complex task using XSLT. <title> is a complex example, but certainly
>>> not the most complex.
>>> It is a tall task to generalize all elements of all documents from all
>>> publishers into one conceptual model with one chosen output for a
>>> frontend-- whether that be for an audience on the Desktop, web-based, or
>>> a handheld.
>>> The complex processing required by the engine will require long, complex
>>> XSLT-- which likely will encorporate callbacks to C++. It will not be
>>> more simple-- only mixed language.
>> I could also argue that the XSL would not require a developer to
>> mentally filter out the code that just identifies and locates XML
>> elements and attributes and parses them from the code that transforms
>> them and generates the output. Thus yes, it might include some
>> extension functions into C++ but it would be simpler. And it would
>> also be more expressive.
>> The enumeration of every OSIS <title> tag is a moot point for the
>> decision. You need to enumerate them all in C++ as well and decide
>> what to do with them. That doesn't change in the XSL - just the
>> method used. An XSL match along the lines of <xsl:template
>> match="title[@type=psalm]"> still has to be done in C++ with some sort
>> of if(tag.name() == "title && tag.attr("type") == "psalm") or whatever
>> the syntax is. And that is assuming the current filter is using
>> XMLTag and isn't comparing character strings directly.
>>> *Semantic vs. Display*
>>> Some will say (and have), "well, let everything be display oriented and
>>> let the publisher decide". Fine, then you lose 2 things: the ability to
>>> display differently per user preference, per display device; and you
>>> also give up the promise to actually do any interesting research on the
>>> text. When you lose semantic markup, then you lose all interesting
>>> information about WHAT is being marked up.
>> I just want to be clear that I'm not advocating the use of display
>> over semantics as a general choice. My statements are strictly based
>> around my specific task and the fact that OSIS support in SWORD and
>> the front ends is not as good as the support of ThML. Largely this is
>> because most applications display in HTML and my required task is
>> framed entirely in terms of the presentation and display - not the
>> semantics. I would love and prefer to use OSIS for this task, but I
>> simply cannot accomplish it with the state of SWORD at this time.
>>> *More than a Rending Engine*
>>> The SWORD C++ Engine is more than simply a text rendering engine-- it is
>>> a Biblical text research engine.
>>> If I'd like to know the morphology of word 3 in 2Thes 2.13 of the WHNU
>>> Greek text, the entire program to do such is:
>>> SWMgr library;
>>> SWModule *whnu = library.getModule("WHNU");
>>> cout << "The morphology of word three is: " <<
>>> whnu->getEntryAttributes()["Word"]["003"]["Morph"] << endl;
>>> That reads nice (at least in my opinion). I don't need to know about
>>> XML, XSLT, care what markup the WHNU module uses, I don't even have to
>>> know how to make a SWORD filter. The current filters do all the work of
>>> breaking out these attributes and making them available in a nice and
>>> interesting map.
>> I'd like to be clear again, that XSL would only be useful for material
>> already in OSIS formats (or in valid ThML - I think TEI is also an XML
>> format?). I doubt many modules in ThML are strictly valid at their
>> import times, so XSL wouldn't be very useful, and GBF is a monster
>> unto itself. Doing the above in XSL from an OSIS source would not be
>> much different in complexity than what you have listed there.
>> <xsl:template match="verse[@osisID='2thes.2.13']/w[@n=3]">
>> The morphology of word three is: <xsl:value-of select="@morph" />
>> Or something similar (my knowledge of exact OSIS attribute names and
>> values wanes and it's been two or three weeks since I wrote an XPath
>> Of course, the string processing portion of SWORD would continue to be
>> of great importance for any modules in GBF format or similar to bring
>> them into a useful form. In that way, SWORD would continue to be more
>> than just a text rendering engine. It would continue to offer all of
>> its features, its buffering from the system and from the format, its
>> indexing, its module fetching and storing, etc.
>>> And finally, if bullets aren't flying already, I'll stir the heat up with...
>>> XSLT sucks. A good C++ programmer can do anything in C++ better than
>>> any XSLT programmer.
>> A C++ programmer can definitely do more, since C++ is actually a
>> programming language and XSLT is a set of processing instructions.
>> Better? That depends on what the criteria is. For me, in my current
>> role as a module creator, the use of C++ is not currently better
>> because it is less flexible and extensible. For you, as the library
>> maintainer, perhaps C++ is better because it's what you are already
>> comfortable with and because it has largely been your hand in the
>>> Have fun.
>>> PS. In summary, I understand the current filters are sometimes overly
>>> complex and need cleanup, standardization, etc. It comes down to the
>>> fact that they mostly work, and other things which don't get priority,
>>> so they don't get much attention. But honestly, I think one might be
>>> oversimplifying the problem at hand without realizing it, if one simply
>>> thinks switching to XSLT will make things easier.
>> I think one is also oversimplifying the options. My dreamlist is that
>> SWORD produce a well-formed, valid, complete OSIS document for an
>> arbitrary KeyList that I pass it with FMT_OSIS set. That basically
>> boils down to getting the *OSIS filters up to snuff and standardized.
>> The second item on the list is a readily extensible mechanism for
>> SWORD outputting HTML from that OSIS. If that choice is providing an
>> XSL stylesheet with the library, a C++ SAX processor that a front-end
>> can readily extend, a DOM interface that can be easily customized is
>> immaterial to me. I like all three of those, and can easily
>> understand and extend all of them.
>> I think any of those technologies would be an improvement over all
>> in-house C++ for the second half of any such processing. If we are
>> using XML in Open Source Software, let's leverage the work of others
>> who have happily given us permission to use their libraries!
>> sword-devel mailing list: sword-devel at crosswire.org
>> Instructions to unsubscribe/change your settings at above page
> sword-devel mailing list: sword-devel at crosswire.org
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel