[osis-core] morph regex error

Chris Little osis-core@bibletechnologieswg.org
Mon, 08 Dec 2003 01:22:04 -0600


As the person who actually requested this attribute, and the one who 
implemented it in 2 or 3 Bibles and 2 morphology tag indices back in 
OSIS 1.0....

My recollection leading up to 2.0 was that we wanted to limit the format 
to the present regex.  (That does not deny that there may have been 
further conversations on the subject to which I was not privy or that I 
do not recollect.)

It's true that mophological tagging schemes do use characters that would 
violate the regex format's requirements--space and hyphen specifically. 
  However, I'm unaware of any system in which it should actually matter 
what character represents these characters, if they get transcoded.  In 
every system that I know, space and hypen simply represent dividers and 
place holders.  They never hold any actual content--they have empty 
semantics.  So if they all get encoded to underscores, then decoded as 
hyphens, that should be fine.  (Indeed, in point of fact, there are 
systems--Friberg comes to mind, but I might be wrong--that are rendered 
with spaces in some instances and hyphens in others, depending on the 
publisher & format.)

My feeling is that it's actually more beneficial to match the osisID 
format in order to allow for linking.  But more importantly, I thought 
we disallowed spaces in attributes unless they divide values in a list.

So, taking a tag like, oh, say "N-NSF", I would just encode it as 
"N_NSF" (And did so 1613 times in tr.xml.)

--Chris

Troy A. Griffitts wrote:

> Patrick,
>     This is a serious restriction/change.  I specifically remember 
> discussing this with you and we agreed that these tags should NOT be 
> restricted to osisID-like syntax.
> 
>     Serious reasons:
> 
>     VERY REAL SCHEMES (probably the only ones that have ever been marked 
> in OSIS) USE OFFENDING CHARACTERS.
> 
>     We have defined no escape character.
> 
>     Without an escape character EVERY SOFTWARE needs to magically KNOW 
> the scheme used to recode these schemes, instead of just mindlessly 
> displaying them to the scholar (which is what should be allowed).  This 
> is unreasonable.
> 
>     I have texts that I need to release with this morphological scheme 
> NOW, not when 3.0 is released.
> 
>     This is NOT a change that should have been applied without 
> everyone's consent.
> 
> 
>     Not to be a jerk, but being the one that asked for this attribute, 
> and being the only one using this attribute that I know of, I'm a little 
> ticked that it was changed.
> 
> 
>     -Troy.
> 
> 
> 
> Patrick Durusau wrote:
> 
>> Troy,
>>
>> I think the regex is correct, no hyphens are allowed. This does not 
>> mean that you should use a range in any of these, although that is 
>> possible. It does allow these to be used as osisRefs so that they can 
>> refer to other sources of information.
>>
>> Perhaps we should revisit at the January OSIS meeting but I don't 
>> think we will reach a different conclusion.
>>
>> Hope you are having a great day!
>>
>> Patrick
>>
>> Troy A. Griffitts wrote:
>>
>>> :)
>>>
>>> Unless I'm going senile-- which I've been suspecting for some time 
>>> now-- I believe that the last discussion on this subject, before 
>>> release of 2.0, concluded that lemma, xlit, gloss, and morph WOULD 
>>> NOT be restricted by osisRef syntax.  We would make a separate 
>>> complexType for them, which basically would allow: prefix:any_string
>>>
>>> I think I wanted to allow spaces (expecially for gloss), Patrick 
>>> found real world occurances of other systems that used prohibiting 
>>> characters, as well.
>>>
>>> So the conclusion was either:
>>>
>>> prefix:any_string
>>>
>>> or
>>>
>>> prefix:any string
>>>
>>> I think Steve may have made some push for replacing the 'space' but 
>>> don't remember the conclusion on that one.
>>>
>>> But regardless, there are no spaces in my offending line that I 
>>> quoted earlier, and yet I still get an error.
>>>
>>> If I have to remove the cobwebs to defend this again, I will try, but 
>>> think it's just a mis-sight in the .xsd.
>>>
>>>     -Troy.
>>>
>>>
>>>
>>>
>>> Chris Little wrote:
>>>
>>>> Okay, okay.  No need to shout.  Don't kill the messenger.  Etc. :)
>>>>
>>>> The problem with changing the format is that we can no longer use 
>>>> morph, lemma, etc. values as osisRefs.  As it stands, any of these 
>>>> attributes could double as an osisRef/osisID.  So your lexicon, 
>>>> organized by lemma, could have divisions with osisIDs that are the 
>>>> same as their lemma values.  Likewise, if you organize the Robinson 
>>>> morphology scheme as a sort of lexicon, you can look up entries and 
>>>> tag them with osisIDs that are identical to your morph value.
>>>>
>>>> --Chris
>>>>
>>>> Troy A. Griffitts wrote:
>>>>
>>>>> NO!
>>>>>
>>>>>
>>>>> Chris Little wrote:
>>>>>
>>>>>> Troy A. Griffitts wrote:
>>>>>>
>>>>>>> Hey guys.  It seems we may have messed up the regex on the morph 
>>>>>>> attribute of <w>.
>>>>>>>
>>>>>>> Here my line:
>>>>>>>
>>>>>>> <w xml:lang="grc" lemma="strongs:15" morph="robinsons:V-PAM-2P" 
>>>>>>> xlit="la:agaqopoieite">GREEK UTF8 TEXT HERE</w>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here's the MSV error output:
>>>>>>>
>>>>>>> Error at line:279, column:117 of 
>>>>>>> file:///space/home/scribe/msv/./lexcounts
>>>>>>>   attribute "morph" has a bad value: the value does not match the 
>>>>>>> regular expression 
>>>>>>> "((((\p{L}|\p{N}|_)+)(\.(\p{L}|\p{N}|_))*:)((((\p{L})|(\p{N})|_)+)(((\.(\p{L}|\p{N}|_)+)*))?))". 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The value you give has never been valid.  Hyphens have never been 
>>>>>> allowed in morph or lemma attributes (nor have spaces and various 
>>>>>> other characters).  I think the decision we made before releasing 
>>>>>> 2.0 was to force folks to transcode these as '_'.
>>>>>>
>>>>>> Does that work for you?
>>>>>>
>>>>>> --Chris