[sword-devel] Bible Chapter Titles?

Tue Jun 17 04:46:46 MST 2008

On Jun 17, 2008, at 1:26 AM, Greg Hellings wrote:

> On Mon, Jun 16, 2008 at 9:42 PM, DM Smith <dmsmith555 at yahoo.com>  
> wrote:
>>
>> On Jun 16, 2008, at 9:23 PM, Greg Hellings wrote:
>>
>>> I'm looking through the mod2osis.cpp file, trying to bring its  
>>> output
>>> closer into the form of the module inputs (basing it off of the  
>>> result
>>> of running the tool as compared to the KJV input files).  So far I
>>> seem to have the following problems - I can't seem to find where (or
>>> if) the following information is maintained and retrieved from the
>>> Sword API:
>>
>> I don't think mod2osis has been kept current with the changes to osis
>> nor with osis2mod.
>>
>> mod2osis, if I understand, will also create osis output for  
>> plaintext,
>> gbf and ThML modules. I don't think these filters are robust.
>
> Right now, all of the problems appear to be on the mod2osis side,
> since the module that I'm working from was an OSIS source.  However,
> I've only been hammering away at the first few discrepancies.  So far
> the most common discrepancies that I have encountered are inverted
> order of the morph= and lemma= attributes when they occur on a <w ....>
> tag as well as switching up the order of such attributes as type="x-p"
> marker="¶" (sometimes with a subType="x-added" also) on the
> <milestone...> element.
>
> The order of attributes is something beyond the scope of the mod2osis
> and needs to be updated/changed in the filters themselves.

Order of attributes is unimportant in xml. Every xml processor is free  
to re-arrange attributes as they see fit. It is also permissible for  
an xml processor to remove non-required attributes that match the  
default or add those attributes with their default if the attribute  
was missing.

>  Right now
> I'm running a basic python script on the output of mod2osis to
> manually reorder those, since I don't believe that the XML will really
> be affected by that (and also because I have combed through the OSIS
> filters and cannot figure out how to make that order change - anyone
> know how to do that?  Currently the order is lemma-morph and it needs
> to be morph-lemma as well as the x-p things need to be type-marker
> instead of marker-type).

What requires the order? That program needs to be re-written to not  
require it.

>
>
> I consider that to be trivial changes which don't affect the actual
> functioning of the tool, versus the fact that it was producing invalid
> osisID attributes for chapters a books (a problem which was relatively
> simple to work out).
>
>>
>> Since you are talking about being able to round trip a module created
>> with osis2mod, I'll mention what it does.
>>
>>>
>>>
>>> 1) Where is the equivalent information from the OSIS block below
>>> maintained?  Is it maintained?
>>
>> osis2mod takes an xml file which is presumed to be valid OSIS and
>> based upon that assumption, looks for testament, book, chapter and
>> verse content.
>>
>> It ignores everything in the header element.
>>
>>
>>> There is brief mention of Strongs data
>>> and such in the .conf file, but is that enough to go off of to
>>> recreate this information in general?
>>
>> There is not quite enough info in the conf to recreate the header.
>> Specifically, there are several variants of the work prefix for
>> Strong's numbers and for morphology. Without digging into the module,
>> it is not possible to know what the work ids are. It is possible for
>> us to have a generic header that encodes all the possibilities.
>>
>> Also, the conf does not encode the scope of the work, which is a
>> typical part of the header. To get it exact, one would have to dig
>> into the module.
>
> These are things which an XSLT could remedy.  The XSLT could produce a
> .conf from the OSIS document that does include those things and has
> blank lines on the other absolutely necessary .conf entries.  A module
> maintainer/creator could run the XSLT to auto-create the .conf file
> and then manually fill in the additional fields which are not normally
> part of the OSIS file (or which were missing from the OSIS file).  If
> we do that, then we can preserve this information for mod2osis to
> recreate.
>
>>
>>
>>> Perhaps this information should
>>> be part of a standard .xsl file which we include in tools avialable
>>> for module creators to run.  Have it output a basic .conf file with
>>> the information from the OSIS document and preserve information like
>>> this in it somewhere?
>>>
>>> <   <work osisWork="strong">
>>> <     <refSystem>Dict.Strongs</refSystem>
>>> <   </work>
>>> <   <work osisWork="robinson">
>>> <     <refSystem>Dict.Robinsons</refSystem>
>>> <   </work>
>>> <   <work osisWork="strongMorph">
>>> <     <refSystem>Dict.strongMorph</refSystem>
>>> <   </work>
>>>
>>>
>>> 2. Chapter titles?
>>> How do you test for the presence of a chapter title?
>>
>> There are testament, book and chapter titles. These have special
>> notations using 0 as the index.
>>
>> For example John 1:0 is the chapter title for chapter 1 and John 0:0
>> is the book title.
>>
>> In osis2mod, the content of these are determined by the placement of
>> the text. To simplify: If it stands after the opening of a book but
>> before the opening of a chapter, then it is a book title. If it  
>> stands
>> after the opening of a chapter, but before the beginning of a verse,
>> it is a chapter title.
>
> This is the least cumbersome way I can figure out to try and access
> this - however, it seems to be having some issues (which I added to
> mod2osis, starting right after the sprintf call on line 165 or so,
> that produces the <div type="book" ...> tag):
> [code]
> *char* name = new char(100);
> strcpy(name, tmpKey.getOSISBookName());
> name = strcat(name, "0:0");
> inModule->setKey(new VerseKey(name));
> SWBuf title = inModule->getRawEntry();
> inModule->setKey(tmpKey);
> if(strlen(title.c_str()) > 0) sprintf(buf, "\t<title
> type=\"main\">%s</title>\n", title.c_str());
> [/code]
> That is my attempt to grab the book title and print it out.  However,
> what I'm getting out is the title tag surrounding the OSIS output of
> chapter 1, verse 1 of the book, instead of the title.  Then, the
> intrigue mounts as, just a few lines later, the program segfaults on
> this line:
> [code]
> if ((vkey->Chapter() != lastChap) || newBook) {
> [/code]
>
> Does anyone else have a less cumbersome way of doing this or, more
> importantly, know how to work that so that it does not segfault at the
> next block of code?

Ahh, this is not Java, so I cannot "readily" help :)

>
>
>>
>> We can also have titles that are between verses. These are pre-pended
>> to the verse content and marked as pre-verse.
>
> It sounds like those are irrecoverable as titles, then, with that type
> of setup, or did I misunderstand you?
>
>>
>>
>>> In the following
>>> block, the chapter title itself is easy enough to recreate but at  
>>> the
>>> expense of portability to someone else who wants to give
>>> chapterTitle="The E Creation Tale" or some such thing, but I can't
>>> find access to the information maintained in the <title...> tag.  Is
>>> this information maintained, and if so, how is it accessed?
>>
>> The only thing that is maintained is the actual content of the verse,
>> chapters, books, ..., but not of those elements themselves.
>
> In the case of the KJV module that you've created, the content of the
> chapterTitle= attribute on the chapters is identical to the content of
> the <title...> element that immediately follows it, at least near the
> beginning of Genesis.

This is a bit of a tug-of-war between the OSIS spec and what we  
actually do in osis2mod. The OSIS spec gives 2 ways to encode a title.  
The KJV OSIS uses both, but osis2mod ignores the attribute.

>  It appears that, if we aren't going to be
> utilizing the chapterTitle= attribute, then we can afford to lose
> track of it in the *2mod->mod2osis trip.

True. I think your goal should be the following transformation:

osis module -> osis xml -> osis module -> osis xml
such that the osis modules are identical and the xml files are  
identical.

>
>
>>
>>> It seems
>>> like it would be useful to have, as many Bible editors insert
>>> information like this into the the flow of the text.
>>>
>>> < <title type="main">THE FIRST BOOK OF MOSES CALLED GENESIS</title>
>>> < <chapter osisID="Gen.1" chapterTitle="CHAPTER 1.">
>>>
>>>
>>> 3. Milestoneable verse boundaries?
>>> It doesn't seem that mod2osis has any support for milestone verse
>>> tags, is this correct?
>>
>> I'm not sure I understand. The module contains no notion of verse
>> tags, milestoned or otherwise. In reconstructing the module, it is
>> important to know as one outputs the content of a verse whether it is
>> well-formed, in and of itself, or not. And since OSIS requires that  
>> if
>> the milestoned form is used in one location, it is used consistently
>> everywhere, the only safe output from mod2osis for a verse tag is
>> milestoned.
>>
>>> How would one programaticly detect this, as
>>> well as other milestone elements?  Somewhere, though, it's producing
>>> output like this:
>>> <milestone type="x-extra-p"/>
>>> Is that coming from the markup filter?  That's the only  
>>> explanation I
>>> can find for it.  However, I'm not sure that there's an example of
>>> milestone-support in the KJV document which can be used for testing
>>> that support.
>>
>> osis2mod in order to construct well-formed verses takes the <p>
>> element (which is the only container element in OSIS that cannot be
>> milestoned) and replaces it with <lb type="x-paragraph-begin"/> and
>> <lb type="x-paragraph-end"/> (I am doing this from memory, so the
>> attribute value might be a bit different.)
>
>
> Currently the KJV has the <verse...> *some text* </verse> syntax,
> which is maintained by mod2osis.  However, it does use <milestone.../>
> for some things (currently the most prevalent appears to be
> type="x-p", to the point that I haven't encountered any others, though
> I haven't gotten very far into the text yet).  It seems safe, at least
> for now, that, if we're going to only accept <verse>...</verse> syntax
> and not allow the <p>...</p> syntax, it's not a problem.  However, I
> thought that the purpose was to force people to use <p>...</p>, which
> can often break the <verse>...</verse> syntax, due to editorial
> choices.  Why have we gone the exact opposite way?

This would make for a good separate thread, but let me see if I can  
summarize.

Verse numbers were added late in Christian history (about 1000 years  
ago), even chapters and paragraphs are not original. In the original  
Greek manuscripts, even lower case letters, spaces, diacritics and  
punctuation were absent.

Some argue that the proper OSIS structure is that of a document upon  
which verses are imposed. This would be Book, Chapter, Section, and  
Paragraph.

Others would argue that this is secondary to the ability of software  
to process the document in a manner that users want to use the Bible.

Most of our applications require a verse to be well-formed and  
meaningful in isolation (such as a search result list; parallel view).  
This is especially true of applications that render HTML.

Most users still want to see verse numbers and think of the Bible as  
being structured by verses.

The osis2mod process is tasked with taking all valid input and  
creating a module that works for all SWORD engine processes. It will  
transform it as needed into structures which may not be particularly  
good OSIS, but will be valid OSIS. As a process, osis2mod is not able  
to handle "all" valid input in this fashion. Our coding is reactive,  
it is sufficient for what we have encountered so far.

I fall into the camp that believes that the verse is the key  
structure. Since I wrote the KJV OSIS file, and improved osis2mod, you  
will see that I did not structure the file by paragraph.

Also, the KJV is not structured by paragraph. In the KJV, some books  
have paragraph markers. The rest don't. Also in the KJV, there are no  
quote marks, but there are quotes.

>
>>
>> Hope that helps.
>>>
>>>
>>> I'll pass along other questions as I see them.
>>>
>>
>> Looking forward to them.
>>
>> You might want to look at JSword's
>> org.crosswire.jsword.examples.BibleToOSIS that I used to re-create  
>> the
>> KJV OSIS from the module when I was working on the current version of
>> the KJV module. Currently, it just wraps the raw text, with minor
>> modifications to product the module. However,  with a simple change
>> this can be tied to very robust filters for GBF, PlainText, ThML and
>> TEI.
>>
>> In Him,
>>       DM
>>
>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page