[sword-devel] PEG parser for USFM

Ryan Hiebert ryan at ryanhiebert.com
Sat Jan 2 15:50:05 MST 2016


> On Jan 2, 2016, at 2:54 PM, David Haslam <dfhmch at googlemail.com> wrote:
> 
> Please visit http://paratext.org/about/usfm
> [snip]

David, thanks for your assistance. Indeed, I've already become fairly familiar with the 2.4 USFM spec, and my attempts to implement that specification in a PEG grammar are what prompted my questions. I however, am not familiar with actually authoring USFM files. I also do not presently have access to a copy of Paratext to experiment with, and from the registration form I'd surmise that I may not be able to get access to it.

They may seem like silly questions, but I cannot find any specific evidence to assume one way or the other from in the spec.

For instance, in the usfm texts that I've seen, there have been _no_ lines, apart from blank lines, that do not begin with a marker of some kind. Is it the case that a line will _always_ start with a marker? The spec is not clear.

Typically, I'd assume that markers were intended to be an _addition_ to the plain text, but the examples I've seen seem to point to empty lines likely not being of any semantic significance, which indicates against it.

The definition of a marker, the only formal definition I can find for it, is that it goes from a '\' (backslash) to the next ' ' (space). Unfortunately, this is not sufficient for two reasons. The first is that a marker may be on it's own line, and a newline immediately following, without the space required by the definition. The second is that more parsing than that must be done to identify an specific marker, as each marker has its own requirements for the text that may follow it, and some markers must be used together (specifically, those with matching ending markers).

I hope that I've convinced you that I am doing the required work to understand USFM, and that my questions are coming at an appropriate time as to at least attempt to not waste your time needlessly. They are targeted to specific situations, that I think will give me the best insight as to how I should be looking at USFM markup.


1. Is text allowed to be on a line _without_ a marker starting the line?
2. Are blank lines semantically meaningful? That is, if all the blank lines are removed, does the file mean _exactly_ the same thing?
3. Are the non-text markers (one that don't have the ending form( \usfm* ) required at the beginning of all meaningful lines?
4. Is only one non-text marker allowed per line?
5. Must a non-text marker be only at the beginning of a line?


More information about the sword-devel mailing list