[osis-users] Differences in segment list between osisWork and osisPassage's identifier and subidentifier

Weston Ruter westonruter at gmail.com
Mon Jun 21 09:19:34 MST 2010


There are two period-delimited lists in the OSIS spec: osisWork and the
identifier and subidentifier components of what I call "osisPassage". These
are respectively:

Bible.en.KJV
John.3.16!a.1

The regular expression for osisWork is:
((\p{L}|\p{N}|_)+)((\.(\p{L}|\p{N}|_)+)*)?

Whereas the regular expression for the segment lists in osisPassage are:
((\p{L}|\p{N}|_|(\\[^\s]))+)((\.(\p{L}|\p{N}|_|(\\[^\s]))+)*)?

Namely, the osisPassage segments are allowed to have escaped characters
whereas the osisWork segments are not. Is this intentional? Why would one
allow escapes but the other not?

BTW, I have simplified the osisWork regular expression to:

segment_regexp = re.compile(ur"""
    (?P<segment>    \w+    )
    (?P<delimiter> \. | $ )?
""", re.VERBOSE | re.UNICODE)

And the osisPassage identifier/subidentifier segments to:

segment_regex = re.compile(ur"""
    (?P<segment>   (?: \w | \\\S )+ )
    (?P<delimiter>     \. | $     )?
""", re.VERBOSE | re.UNICODE)

These patterns get matched repeatedly until the end of the string. The
Unicodified \w character class in Python may not exactly match the
correspondingly used XML Schema regular expression character classes, but
they should be very close and practically equivalent.

So is there a reason why osisWork and osisPassage have different segments
allowed?

Weston
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/osis-users/attachments/20100621/00ac6163/attachment.html>


More information about the osis-users mailing list