[sword-devel] Entities in modules

DM Smith dmsmith at crosswire.org
Thu Nov 12 11:05:03 MST 2009

On 11/12/2009 12:08 PM, Sebastien Koechlin wrote:
> On Wed, Nov 11, 2009 at 03:50:12PM -0500, DM Smith wrote:
>> We have a few modules that have entities in them. These are of the fashion
>>   (a character entity),U (a numeric decimal entity) andÅ
>> (a numeric hex entity).
>> These cause various problems:
> This is because osis2mod does not use an XML parser.

I'm not seeing the problem in OSIS modules, but in ThML modules. They 
are perfectly valid in ThML modules, but are problematic. I will be 
going over all the modules looking for these and will report problematic 
CrossWire modules in www.crosswire.org/bugs. And I'll pass along any 
problems I find in the Xiphos and Bible.org modules.

My understanding is that a true XML parser has strict requirements as to 
how it is to handle errors: put out an error message and die.

If we used a true XML parser for osis2mod, it would die on the first 
character entity that was not &, <, > or " unless it were 
defined in the schema. OSIS does not define additional character entities.

We make the assumption that input to osis2mod has been validated against 
the OSIS schema. If this is true then there are no character entities in 
the input.

>   Character entitie is
> just a useful way to write a characters you can not or you want not to
> put in your XML file. When parsed and resolved, they must not be
> distinguable from others characters. The same apply for CDATA sections.

I agree with the statement above as far as it goes. But what is the XML 
parser to do when it discovers a character entity that it cannot resolve?

> osis2mod should not keep entities when reading an OSIS file. I think it's a
> big mistake and we should not rely on external programs many people will
> have trouble to run.

I'd agree that numeric entities should be converted. And I think that 
osis2mod should complain if it finds entities that are not valid for an 
OSIS document and prompt the user to validate the input document.

Regarding module writers having trouble running tools, we've talked 
about having a web service at CrossWire.org that would provide the 
appropriate validation, conversion, creation, .... of an OSIS text. 
We've just not had a volunteer step up to the task.

> We also had troubles with non-canonical Unicode sequences and I think
> osis2mod was corrected.
> Named entities as nbsp came from HTML and should not be used in OSIS as they
> are not declared in osisCore.2.1.1.xsd, it result in an invalid document.
> BUT, as we do not use an XML parser, we can use the HTML DTD[1] to resolve its
> and be more friendly with OSIS writers.

The problem with using entities that are not allowed in OSIS is that one 
cannot validate against the OSIS schema. And because OSIS is not HTML, 
one cannot validate against it either.

For osis2mod to handle other character entities other than the 4 
mentioned above, means that it cannot expect valid OSIS.

> [1] see thoses URL, for this a perl program can produce a .cc or .h file.
> 	http://www.w3.org/TR/html4/HTMLlat1.ent
> 	http://www.w3.org/TR/html4/HTMLsymbol.ent
> 	http://www.w3.org/TR/html4/HTMLspecial.ent

The code I provided does so many more than just these character entities.

> (Sorry if my message look rude, I'm not native english speaker)

I didn't take your response as rude. I appreciate your input. I think 
our goals are the same, to produce the highest quality modules 
minimizing the effort to do so.
All for God's glory.

In Him,

More information about the sword-devel mailing list