[osis-core] OSIS Candidate.1.1_001 - 5 bad osisWorkType regex

Chris Little osis-core@bibletechnologieswg.org
Sat, 17 Aug 2002 13:59:09 -0700 (MST)


> I think what is desired is:
> ((\p{L}|\p{N}|_)*)(\.(\p{L}|\p{N}|_)*)*

I was a little off on this regex.  I think what we REALLY want is:
((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_)+)*

(because we want each component to be at least 1 char long; otherwise 
"", ".", "..", & ".a..b" would all be valid)

however,
[\p{L}\p{N}_]+(\.[\p{L}\p{N}_]+)*
is a bit more compact way of expressing this that most Perl folks would 
probably prefer, if Schema will allow it.

I also took a look at osisRefType, which I think should be:
(((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_)+)*:)?((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_)+)*(@(cp:(\p{Nd})+|str\[(\p{L}|\p{N})+\]))?(-(((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_)+)*(@(cp:(\p{Nd})+|str\[(\p{L}|\p{N})+\]))?))?

I had two questions about this.... Didn't we decide that 
Bible:Gen.1.1@cp[5] was proper syntax rather than Bible:Gen.1.1@cp:5 ?  I 
left the latter as the correct syntax since that was present in the last 
posting.  Also, shouldn't we open the characters allowed in string up to 
more characters?  Spaces, punctuation, symbols, etc. come to mind, though 
I don't know how to represent them in a regex.

The other change I made was to allow:
(work:)ref(@grain)(-ref(@grain))

Parentheses indicate optional elements.  So one ref is mandatory.  
No more than one work is allowed, preventing ranges across different works 
(I've no idea how that would work).  And grain can be specified in either 
or both of the references in a range.

Accordingly, osisIDType would be:
(((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_)+)*:)?((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_)+)*( (((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_)+)*:)?((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_)+)*)*

if I understand correctly what it is supposed to represent, namely:
(work:)ref( (work:)ref))*
that is, any number greater than zero of space delimited refs, with 
optional works, but not grains.

The main changes to this were removing spaces around |'s, making 1 
character minimum refs/works, and changing the /s to a ' ' since I didn't 
think we wanted that to allow things like \t, \n, & \r.

Let me know if I misunderstood the use/format of any of these types.  And 
if Schema will accept the more compact notation without the pipes, let me 
know, and I'll redo the regexes in that format.

--Chris