[osis-core] OSIS_0105:19 Regexs

Sat, 06 Apr 2002 13:55:33 -0500

Guys,

Just to be consistent,

The regex stuff from yesterday, nothing new:

1. Regexs:

Generally see: http://www.w3.org/TR/xmlschema-2/#regexs

ReferenceType

Now reads: ([^.]+)((.[^.]+){0,})?

Note that "^" begins a negative character group.

Note that the "." character in XML Schema is the equivalent of: [^\n\r] 
: any character except newline

So, [^.] means only newline (excludes all other characters)

Or more formally from the standard:

[Definition:]   A * negative character group* is a ·positive character 
group· <http://www.w3.org/TR/xmlschema-2/#dt-poschargroup> preceded by 
the |^| character. For all ·positive character group· 
<http://www.w3.org/TR/xmlschema-2/#dt-poschargroup> s /P /, ^/ P/ is a 
valid *negative character group*, and / C(^P)/ contains all XML 
characters that are /not/ in /C(P)/ .

*Negative Character Group*
|[15]   | | negCharGroup| |   ::=   | |'^' posCharGroup 
<http://www.w3.org/TR/xmlschema-2/#nt-posCharGroup> |

I assume the intent of the expression is:

1. Any legal namestart character, followed by,
2. Any legal name character, followed by,
3. literal "." character, followed by
4. one or more groups of legal name characters separated by a literal "."

If that is the case, I would suggest that we re-write ReferenceType to 
read:

([\i]([\c])*\.((\c)*\.)?

Note that \i = any legal initial name character, \c = an y legal name 
character, \. = literal "." or full stop

Additionally, since we have compScriptureReferenceType (I treat that 
regex below) not sure what ReferenceType is getting us in terms of 
validation? Structure of the references? Perhaps, would welcome some 
discussion on this and WorkType (next).

(BTW, schema regexs always match from the beginning of the line so no 
need to anchor.)

WorkType:

Now reads: ([^.]+(.[^.]+)

Same problems as above with "^" and invoking of literal full stop.

Is the intent of this expression the same as ReferenceType?

In other words to:

1. Any legal namestart character, followed by,
2. Any legal name character, followed by,
3. literal "." character, followed by
4. one or more groups of legal name characters separated by a literal "."

if so, why would I want both of them? For that matter, the more I think 
about it, I am not sure what function either one would serve, at least 
in light of our not declaring a set of references to other works.

Suggestion: Why not settle on an outside reference pointer that 
subclasses xs:string the way we have for enumerated values on 
attributes. You can at this point declare whatever other pointers you 
like, but prepend "x-" to them? That would allow us to later (probably 
by the Fall release of translator and publisher modules, to declare 
references like compScriptureReferenceType that provide validation of at 
least part of the reference?

compScriptureReferenceType:

Now reads (in part) ((...All Book Names...))((.[^.]+){0,}))?

Same problems as above with "^" and invoking of literal full stop.

In other words to:

1. Book Name, followed by
2. literal "." character, followed by
3. any digit or letter (one or more) (question, do we need letter for 
some Bible references?), followed by
4. literal "." character, followed by
5. any digit or letter (one or more) (question, do we need letter for 
some Bible references?), followed by (optional)

If that is the case, would the following work?

((...All Book Names...))\.[A-Za-z0-9]*(\.[A-Za-z0-9]*)?

Note that this expression requires book name plus chapter, could someone 
want to just refer to Matthew?

Proposal:

The Argument (don't you just love Milton!):

For elements themselves, we want to allow them to have IDs to which 
other things can point, either by IDREF (milestones) or by linking from 
within or from without. This is the "who am i" function of an 
identifier. Restricted by the XML Name requirements, if it functions as 
an ID.

Other thing we want (I think) is the ability of notes, milestones, and 
other objects to refer to other elements (usually a containment like 
relationship) that they refer to or contain. This is a "i start at" and 
"i end at" type function. Obviously must use the IDs found on other 
elements.

We can partially validate scripture references since we are declaring a 
known set of names and format for the references to the materials 
referenced by those names.

The Suggestion:

Separate the notion of Bible references from other references more 
generally. For non-Bible references, simply defer by declaring non-Bible 
references to be "x-" and to be treated at some later point with a 
validation mechanism like we have for Bible references.

For verse milestones, restrict the IDs to the compScriptureReferenceType 
so that we get validation for the "who am i" function here.

(Agreeing with Troy here that StartVerse and StartVerse as attribute is 
confusing. Just use ID, datetype ID.)

Books, divs, paragraphs, IDs can use compScriptureReferenceType but not 
required.

Books, divs, paragraphs, notes, etc.,  the "where I point" function (not 
"who am i") should be IDREF and have the names like: startNote = 
"John.1.1", startDiv = "Gen.1.1" , endNote = "John.1.2", etc. Note that 
making these IDREFs makes us certain that the IDs appear in the work 
(XML validation process) and enforces as well the use of 
compScriptureReferenceType in the encoding.

Patrick

-- 
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
pdurusau@emory.edu