[sword-devel] ESV encoding bug

DM Smith dmsmith555 at yahoo.com
Sat May 6 07:20:44 MST 2006

I did some testing using the WebTool and the latest BibleCS release  

On May 5, 2006, at 9:55 PM, DM Smith wrote:

> A couple of bugs, possible bugs and stylistic variation: (All using  
> Psalm 1.1 as an example. Numbered but no particular order.)
> 1) XML defines most control characters as being errors. In parsing  
> ESV with Apache's Xerces, I am getting the following error in Psalm  
> 1.1:
> Error on line 1: An invalid XML character (Unicode: 0x10) was found  
> in the element content of the document.
> Question, should this be fixed? I'm pretty sure that the SWORD API  
> handles this as whitespace. But it causes JSword to not show the  
> verse (I can change JSword to filter these characters, but that is  
> awfully expensive to cleanup what is not supposed to be there)	

Web tool strips out the "offending" character because it handles xref  
notes differently than BibleCS.
BibleCS shows a box for the bad character.

> 2) Which version of OSIS defines type="i" for the <hi> element? I  
> presume that it stands for italics. If so, the attribute is "italic"

Not a bug.

> 3) Same question regarding <title type="section">. Since this is a  
> new module shouldn't we adhere to the current standard? Section is  
> not one of the pre-defined types for a title. So it should be  
> preceded by x-.

Not a bug.

> 4) <milestone type="line"/> According to the OSIS manual this is to  
> be used to mark a line in the original that is to be preserved, but  
> is not to be a part of the general presentation of the document.  
> The element <lb/> is defined as being allowed anywhere and it is a  
> line break that should be presented. In earlier versions of OSIS  
> the <lb/> element was limited to the poetic elements.

Not a bug.

> 5) <note type="crossReference" osisID="Ps.1.1.xref_b" n="b"> The  
> OSIS manual recommends a different representation of the osisID on  
> a note. First, in the examples, it uses ! to separate the "Ps.1.1"  
> and "xref_b". It also suggests that failing the presence of the n  
> attribute (which it recommends against using the attribute) that  
> the note marker can be deduced from the osisID. It is not at all  
> clear how one could do this, but the examples in the manual all  
> use . rather than _ to separate the "marker" from what precedes it.  
> The manual also recommends that all notes have an osisRef to the  
> verse to which it refers, allowing for notes to be extracted but  
> still be "attached" to their verse.

Not a bug.

> 6) Notes are preceded by whitespace and immediately followed by  
> content. If note markers are placed where they occur, this would  
> indicate that the note refers to what follows (i.e. to what it is  
> adjacent). I am not sure this is what is intended.

Both the Web tool and BibleCS add extra space around the note  
markers. The extra space is visible in both. It is fairly non- 
consequential. But the note marker is closer to what follows than  
what precedes.

> 7) line breaks (i.e. <milestone type="line"/>) are immediately  
> followed by whitespace in some cases. This may cause subtle  
> whitespace issues. It would be better to have whitespace precede  
> the break or be eliminated altogether.

The Web tool adds extra space before each verse and at the beginning  
of each line. So this is hidden with the browser's handling of HTML.  
If space is not added to the beginning of each verse, it would become  

BibleCS handles all whitespace as significant and this makes the left  
margin ragged.

> 8) There is an instance of multiple whitespace between two words.  
> In JSword and Sword WEB as well as any other HTML based system,  
> this won't be a problem. But, I don't know about other front-ends.

BibleCS has a huge gap.

> 9) The is a missing whitespace between "seat" and "scoffers". It is  
> encoded as "seat<note...>....</note>scoffers" I don't think we  
> should have to guess where whitespace belongs relative to a note.  
> If the note is hidden, it forces the words together.

Bug in both.

More information about the sword-devel mailing list