[sword-devel] Are individual verses in a module "well formed"

DM Smith dmsmith555 at yahoo.com
Wed Apr 20 13:33:13 MST 2005


There are various contexts in which it makes sense to display individual 
verses. E.g. Quotations. Also, in the context of search, individual 
verses are returned. Adjacent verses may be merged into a passage. These 
are then presented to the user. Different clients do it differently. For 
these to be handled without error either we need more information (e.g. 
the boundaries of what constitutes a well-formed segment) or the code 
needs to do sophisticated error recovery.

When an individual verse is expanded to include other verses, the client 
needs to know that what was asked for and what was gotten are different 
and how it differs so it can be handled well and be communicated to the 
user in a useful fashion (e.g. show the context and highlight the hit)

The approach I was thinking about was to analyze any verse that failed 
to validate for unmatched tags. If an unmatched end tag was found then 
prefix the previous verse and try again, repeatedly until success or 
some defined threshold of pain was reached. Similar with a missing end 
tag: instead, get the following verses. If there are missing begin and 
end tags then grow each side until the problem is solved or it is too 
painful.

I mention the problem of pain, because in the case of the KJV, the notes 
are badly encoded and this technique won't work. E.g. <note/>...</note>, 
has a missing begin tag for which none will be found, as what was 
supposed to be the begin tag was encoded as an empty tag.

I also thought about adding artificial begin or end tags to the segment. 
In this case the text needs to be analyzed to determine what the invalid 
tags are. I think this is more difficult as some tags have 
required/expected attributes. While it may pass the well-formed test, it 
may be invalid and that may cause it to be rendered badly.

JSword's current technique is to gradually strip out stuff (bad 
characters, reserved characters and finally all xml stuff) and 
ultimately be left with something that frequently looks very bad.

What is going through my head as a possibly workable solution is to 
create another index for well-formed boundaries. The basic idea is that 
for every verse that is not well formed, that a start and end verse 
would be given for well-formedness. Essentially a map of book structure 
(book/section/paragraph/list/table/poetry...) from a verse perspective. 
In some cases, like poetry, it may be best to have the entire poem be a 
context as it may render better, than the smallest well-formed context 
in which a verse resides. Even in this case, the verse may be well 
formed, but be in a context that should be rendered as a whole. 
Personally, I would try to do this in lucene, with the verse being 
indexed and storing the start and end with the verse's doc.

Note, this is not just an XML problem. GBF has the notion of matched 
begin and end tags and these may be in different verses.

Troy A. Griffitts wrote:

> DM,
>     No, SWORD currently does no work to promise any retrievable 
> segment of text as valid markup.  I have talked with a few XML experts 
> and have had a number of ideas brewing for the past few years how we 
> might offer such information, as it is a necessary obstacle to overcome.
>
>     The question, more generally, really is:
>
>     How can one package and send a segment of an XML document.  Steve 
> DeRose has pointed me to at least one project/standard which tries to 
> address this issue.  I need to review my email archives and study 
> their solution.  My ideas, very generally are either:
>
> With each retrieved segment of text from the API, provide a context 
> tag stack object which described the tag context at the start of the 
> segment.
>
> or
>
> Do the actual work of returning valid XML for a segment of text, and 
> provide an attribute in all supplied markup to designate it as such:
>
> <verse osisID="Mat.6.10"><q who="Jesus" level="1" sID="Mat.5.3.q1" 
> misc="phantom" /><q who="Jesus" level="2" sID="Mat.6.9.q1" 
> misc="phantom" />Your kingdom come. Your will be done, On earth as it 
> is in heaven<q eID="Mat.6.9.q1" misc="phantom" /><q eID="Mat.5.3.q1" 
> misc="phantom" /></verse>
>
> Note that this last example doesn't really supply any REQUIRED FOR XML 
> VALIDITY, but does provide the more important tags required to 
> represent the text correctly.  And also not that any 'phantom' TRUE 
> END TAGS will not be identifiable, as we cannot supply an attributed.
>
> I think the first option works best for our engine design.  When a 
> client iterates a chapter, making 1 call for each verse, they aren't 
> concerned with valid XML for each verse, but rather, they want any 
> context when they start the segment (chapter in our example) and then 
> they may want to close any remaining open tags when done rendering.
>
> But it's all still just rumbling around in my mind, so any ideas are 
> very welcome.
>
>     -Troy.
>
>  DM Smith wrote:
>
>> I asked this earlier on another thread, but it was lost in the noise 
>> of that thread.
>>
>> Does Sword, in making a module, ensure (or try to ensure) that each 
>> verse is well formed? That is, for every begin feature marker, there 
>> is a corresponding end feature marker. In XML and ThML it would be a 
>> <tag>...</tag> or <tag/> but in gbf it might be a matched pair <TAG> 
>> <tAG>.
>>
>> If not, is there some boundary (e.g. chapter) that is guaranteed to 
>> be a well-formed unit?
>> And any suggestions on how to manage individual verses that are not 
>> well formed?
>


More information about the sword-devel mailing list