[sword-devel] ESV encoding bug

DM Smith dmsmith555 at yahoo.com
Fri May 5 18:55:57 MST 2006


A couple of bugs, possible bugs and stylistic variation: (All using  
Psalm 1.1 as an example. Numbered but no particular order.)

1) XML defines most control characters as being errors. In parsing  
ESV with Apache's Xerces, I am getting the following error in Psalm 1.1:

Error on line 1: An invalid XML character (Unicode: 0x10) was found  
in the element content of the document.

Question, should this be fixed? I'm pretty sure that the SWORD API  
handles this as whitespace. But it causes JSword to not show the  
verse (I can change JSword to filter these characters, but that is  
awfully expensive to cleanup what is not supposed to be there)

2) Which version of OSIS defines type="i" for the <hi> element? I  
presume that it stands for italics. If so, the attribute is "italic"

3) Same question regarding <title type="section">. Since this is a  
new module shouldn't we adhere to the current standard? Section is  
not one of the pre-defined types for a title. So it should be  
preceded by x-.

4) <milestone type="line"/> According to the OSIS manual this is to  
be used to mark a line in the original that is to be preserved, but  
is not to be a part of the general presentation of the document. The  
element <lb/> is defined as being allowed anywhere and it is a line  
break that should be presented. In earlier versions of OSIS the <lb/>  
element was limited to the poetic elements.

5) <note type="crossReference" osisID="Ps.1.1.xref_b" n="b"> The OSIS  
manual recommends a different representation of the osisID on a note.  
First, in the examples, it uses ! to separate the "Ps.1.1" and  
"xref_b". It also suggests that failing the presence of the n  
attribute (which it recommends against using the attribute) that the  
note marker can be deduced from the osisID. It is not at all clear  
how one could do this, but the examples in the manual all use .  
rather than _ to separate the "marker" from what precedes it. The  
manual also recommends that all notes have an osisRef to the verse to  
which it refers, allowing for notes to be extracted but still be  
"attached" to their verse.

6) Notes are preceded by whitespace and immediately followed by  
content. If note markers are placed where they occur, this would  
indicate that the note refers to what follows (i.e. to what it is  
adjacent). I am not sure this is what is intended.

7) line breaks (i.e. <milestone type="line"/>) are immediately  
followed by whitespace in some cases. This may cause subtle  
whitespace issues. It would be better to have whitespace precede the  
break or be eliminated altogether.

8) There is an instance of multiple whitespace between two words. In  
JSword and Sword WEB as well as any other HTML based system, this  
won't be a problem. But, I don't know about other front-ends.

9) The is a missing whitespace between "seat" and "scoffers". It is  
encoded as "seat<note...>....</note>scoffers" I don't think we should  
have to guess where whitespace belongs relative to a note. If the  
note is hidden, it forces the words together.




More information about the sword-devel mailing list