[sword-devel] Insidious mismatched tag errors: recommendations

Andrew Thule thulester at gmail.com
Sun Sep 23 08:35:36 MST 2012


This is the clearest description of the milestone vs containers, BCV vs BSP
issue I've seen so far.  Thanks for the summary.  It's very helpful.

~A

On Friday, September 21, 2012, DM Smith wrote:

> So far the discussion is around whether the xml is well-formed.
> Once you get that working, then you need to make sure it is valid wrt the
> OSIS schema.
>
> There's an old tool that will convert sgml to well-formed xml. I think it
> was James Clark's "sx". I've used it successfully on initial conversions
> and getting something that will work within xml tools.
>
> Finally, OSIS has the notion of milestones for start and end elements.
> There are semantic rules regarding this that cannot be checked by standard
> xml tools. Osis2mod tries to handle this. When you get to that point, I can
> help unravel the logging options.
>
> The purpose of milestoned elements is to allow for two competing document
> models to be in the same xml document: BSP and BCV (names we've given it
> here and in the wiki).
>
> We recommend using BSP (book, chapter, section, paragraph, poetry, lists
> to all be containers, not milestoned) and verse elements be milestoned.
>
> Note, the OSIS manual says that if you have one element milestoned, then
> all other elements with the same tag name have to be milestoned.
> Practically speaking, this does not matter. SWORD and JSword don't care.
> Having verses milestoned only if necessary is probably a better way to
> create a good XML document. Start out with all of them as containers and
> each place where that causes a problem, either fix the xml or if otherwise
> correct, convert to milestoned verses.
>
> Generally speaking these BSP elements should not start just inside or at
> the end of a verse. Rather they should be between verse elements or within
> the text. When they are placed just after the verse start, they often will
> cause the verse number to be orphaned. When they are placed just before the
> verse end, then it is generally not noticeable (just bad form).
>
> Quotes will create the biggest grief in the above. They often cross
> boundaries. Certainly, the beatitudes does, starting in one chapter and
> ending a couple of chapters later. For this reason, using the milestoned
> version is necessary.
>
> If you're document follows some simple rules (some required by xml, others
> simplifications), then checking nesting is a simple matter of having a
> push/pop stack of elements. The simple rules:
> 1) All attributes when present have quoted values.
> 2) All entities are properly formed and used when needed. Also, < and >
> are not in attribute values.
> 3) Tags are marked with < ... >, </ ... >, or < ... />. and now new lines
> between < and >.
>
> If this is true then a simple perl script can be written to find the
> problems in the file:
> Look for < ... /> and skip them. They cause no problems.
> Look for < xxx ... > and push the tag name along with its location in the
> file on to the stack.
> Look for < xxx />, compare xxx to the top element on the stack. If it
> doesn't match, then it causes an error.
> When you get to the end of the document and the stack is not empty, then
> the elements on the stack are not closed properly.
>
> Printing out the stack (elements and locations) would help find what the
> problem is.
>
> For example:
> if xxx is deeper in the stack, then there is a problem with nesting. Look
> at all the elements above the xxx on the stack for problems.
> if it is not in the stack, then the element was not started prior to that
> point or it may have been ended twice.
>
> Here is a simple perl script (that I wrote), which doesn't do that, but
> could be adapted to do it. This creates a histogram/dictionary of tag and
> attribute names.
>
> #!/usr/bin/perl
>
> use strict;
>
> my %tags = ();
> my %attrs = ();
> while (<>)
>   {
> #print;
>     # While there is a tag on the line
>     while (/<[^\/\s>]+[\/\s>]/o)
>     {
>       # While there is an attribute in the tag
>       while (/<[^\/\s>]+\s+[^\=\/\>]+=\"[^\"]+\"/o)
> {
>   # remove the attribute
>   s/<([^\/\s>]+)\s+([^\=\/\>]+)(\="[^\"]+\")(.*)/<$1 $4/o;
>   my ($t, $a, $v, $r) = ($1, $2, $3, $4);
>   $attrs{"$t.$a"}++;
> }
>       # remove the tag
>       s/<([^\/\s>]+)[\/\s>]//o;
>       $tags{$1}++;
> #print("do next tag on line\n");
>     }
> #print("do next line\n");
>   }
>
> foreach my $tag (sort keys %tags)
>   {
>     print("$tag\n");
>   }
>
> foreach my $attr (sort keys %attrs)
>   {
>     print("$attr\n");
>   }
>
> Hope this helps,
> DM
>
> On Sep 21, 2012, at 10:52 AM, Andrew Thule <thulester at gmail.com<javascript:_e({}, 'cvml', 'thulester at gmail.com');>>
> wrote:
>
> Thanks everyone for suggestions.  I'll give them all a try.
>
> That said, the emacs recommendation is nearly a religious conversion
> recommendation.  (I'm on the vi side of the vi verses emacs debate.  I
> suppose as long as it doesn't kill me I should give it a try, though I'm
> not certain what impact it will have on the health of my soul ... :D )
>
> ~A
>
>
> On Thursday, September 20, 2012, Daniel Owens wrote:
>
>> I use jEdit with the XML plugin installed. I find it helps me find
>> problems fairly easily.
>>
>> Daniel
>>
>> On 09/20/2012 05:26 PM, Greg Hellings wrote:
>>
>>> There are a number of pieces of software out there that will
>>> pretty-print the XML for you, with indenting and whatnot. Overly
>>> indented for what you would want in production but decent for
>>> debugging mismatching nesting and the like.
>>>
>>> For example, 'xmllint --format' will properly indent the file, etc. I
>>> don't know how it will handle poorly formed XML.
>>>
>>> GUI editors can do wonders as well. On Windows I use Notepad++ and
>>> manually set it to display XML. gEdit and Geany - I believe - both
>>> support similar display worlds. And there are some plugins for Eclipse
>>> that might handle what you need as well.
>>>
>>> --Greg
>>>
>>> On Thu, Sep 20, 2012 at 4:19 PM, Karl Kleinpaste <karl at kleinpaste.org>
>>> wrote:
>>>
>>>> Andrew Thule <thulester at gmail.com> writes:
>>>>
>>>>> One of my least favour things is finding mismatched tags in OSIS.xml
>>>>> files
>>>>> Has anyone successfully climbed this summit?
>>>>>
>>>> XEmacs and xml-mode (and font-lock-mode).  M-C-f and M-C-b execute
>>>> sgml-forward-element and -backward-.  That is, sitting at the beginning
>>>> of <tag>, M-C-f (meta-control-f) moves forward to the matching </tag>,
>>>> properly handling nested tags.
>>>>
>>>> ______________________________**_________________
>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>> http://www.crosswire.org/**mailman/listinfo/sword-devel<http://www.crosswire.org/mailman/listinfo/sword-devel>
>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>> ______________________________**_________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/**mailman/listinfo/sword-devel<http://www.crosswire.org/mailman/listinfo/sword-devel>
>>> Instructions to unsubscribe/change your settings at above page
>>>
>>>
>>
>> ______________________________**_________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/**mailman/listinfo/sword-devel<http://www.crosswire.org/mailman/listinfo/sword-devel>
>> Instructions to unsubscribe/change your settings at above page
>>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org <javascript:_e({},
> 'cvml', 'sword-devel at crosswire.org');>
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20120923/3fc6120d/attachment-0001.html>


More information about the sword-devel mailing list