[osis-core] Schema: type on language

Chris Little osis-core@bibletechnologieswg.org
Sun, 19 Oct 2003 09:58:32 -0700 (MST)


Todd,

On Sun, 19 Oct 2003, Todd Tillinghast wrote:

> Chris,
> 
> Are you saying that you will not able to sort out which of the many
> forms allowed in IETF/xml:lang has been stated and that you would like
> to use <language type="...">language code</language> to help sort out
> with case has been encoded, but that the values for <language> and
> xml:lang would be identical?  
> 
> That seems resonable.  

Almost.  I'm saying it would be reasonable for an organization like SIL to 
encode:
<language use="base" type="IETF">sq</language>
<language use="base" type="SIL">ALS</language>

That is, they should be able to identify the language according to a 
common form, to be used by all documents & organizations, identical to the 
form used for xml:lang (the IETF form).  But they should also be able to 
use a form of their own for in-house categorization.

Using values like "x-ISO-639-1-sq" might be valid, but to be of any use, 
it would have to be parsed as a string and cut into chunks.  I say, why 
not just use type and be more explicit.

> It also seems unfortunant that the XML/ISO standards bodies have made it
> difficult for it to be obvious which standard is being used.  (I am sure
> with an enumeration of all possible values you can derive which standard
> a value comes from.)

The only real ambiguity comes with discerning between ISO 639-2/T and /B.  
Besides that, 2-letter elements are ISO 639-1, 3-letter are one of the -2 
standards, those starting with i- are IANA, and everything starting with 
x- is officially unknown.

> I am not sure why you want to add "French", "English", and "native"?
> This would seem to further confuse the situation.  Maybe I don't
> understand how you would use them.

My thought was to add it as a convenience to those who might wish to use 
it.  Rather than forcing lookups from a table that maps codes to language 
names, the name would be held in the document.  The reason for choosing 
English & French is that they are the international languages used by ISO 
& SIL for their code databases.

If you think it would be better to leave this out, I'm okay with that.

> Relative to people using codes like "Austronesian (Other)", I think the
> documentation should recommend a "concrete" language for xml:lang and
> that a <language> entry for "Austronesian (Other)" would be fine to use
> within <work> in addition to the "concrete" language code.

I'm in agreement here.  I think the value for xml:lang should match that 
chosen for the IETF type, and should identify the most specific language 
code that makes the encoder happy.

Going back to Albanian... Ethnologue lists 4 dialects of Albanian, all of
which would be identified with ISO 639-1 code 'sq', but different SIL
codes.  Dialects of a single language can often have a common written
form.  If that is the case with Albanian and I have a Bible in the
common written form, I might (if I were SIL and wanted to identify SIL 
codes in my work) encode:

<osisText xml:lang="sq">
...
<language type="IETF">sq</language>
<langauge type="SIL">AAH</language>
<language type="SIL">AAE</language>
<language type="SIL">ALS</language>
<language type="SIL">ALN</language>

However, if they were not all the same written language and I had a Bible 
written specifically in Tosk Albanian, I would encode:

<osisText xml:lang="x-SIL-ALN">
...
<language type="IETF">x-SIL-ALN</language>
<language type="ISO-639-1">sq</language>

Does that seem sensible?

--Chris


> 
> Todd 
> 
> > -----Original Message-----
> > From: osis-core-admin@bibletechnologieswg.org 
> > [mailto:osis-core-admin@bibletechnologieswg.org] On Behalf Of 
> > Chris Little
> > Sent: Sunday, October 19, 2003 2:25 AM
> > To: osis-core@bibletechnologieswg.org
> > Subject: RE: [osis-core] Schema: type on language
> > 
> > 
> > 
> > Todd,
> > 
> > For one, it's questionable whether we can really say any 
> > language can be 
> > unambiguously identified.  But let's suppose we really know 
> > what English 
> > is and we really know that 'en' identifies it.  ISO 639 does 
> > a better job 
> > of unambiguously identifying some languages than it does for others.  
> > There are a bunch of codes that describe groups of codes, 
> > such as "Native 
> > America Indian" and "Austronesian (Other)".
> > 
> > So, it's not quite true that Javanese has no ISO code, it's 
> > just a very, 
> > very ambiguous code shared with hundreds of other langauges.  
> > (The code 
> > would be 'map' -- "Austronesian (Other)".)
> > 
> > I think it is valuable to keep type="...", since some 
> > organizations use 
> > those codes themselves for various sorting purposes (e.g. the 
> > Library of 
> > Congress uses ISO 639-2/B and SIL uses Ethnologue codes).  If 
> > they need to 
> > use such data, I think we should provide a place to hold it.
> > 
> > But for interoperability, IETF/xml:lang is probably best.
> > 
> > What are your thoughts on also adding "English", "French", & 
> > "native" to 
> > the types enumeration.  Is that unnecessary/inappropriate?
> > 
> > 
> > --Chris
> > 
> > 
> > On Fri, 17 Oct 2003, Todd Tillinghast wrote:
> > 
> > > Chris,
> > > 
> > > If there is a way to unambiguously express ALL of the 
> > various language 
> > > values using xml:lang in a IETF compliant string then it 
> > would seem to 
> > > make sense to use that same structure for the value of 
> > <language> and 
> > > for xml:lang AND not have a type="..." set of enumerated types.
> > > 
> > > Ex:
> > > Javanese for which there is not ISO code:
> > > <osisText xml:lang="x-SIL-JVN">
> > > and 
> > > <work>
> > >    <language>x-SIL-JVN</language>
> > > </work>
> > > 
> > > Albanian:
> > > <osisText xml:lang="sq">
> > > and
> > > <work>
> > >    <language>sq</language>
> > >    <language>x-ISO-639-1-sq</language>
> > >    <language>x-ISO-639-2-T-sqi</language>
> > >    <language>x-ISO-639-2-B-alb</language>
> > >    <language>x-SIL-ALS</language>
> > > </work>
> > > 
> > > This would keep the xml:lang and <language> values consistent.  It 
> > > would seem that we will have to enumerate the "x-" alternatives for 
> > > xml:lang in the documentation so we might as well use the same 
> > > structure both places.
> > > 
> > > I believe that "x-" is allowed in the w3c's xml.xsd schema so the 
> > > above options should work.  (Naturally if there is already an 
> > > established syntax for ISO values within xml:lang we should use it 
> > > rather than my x- values above.)
> > 
> > 
> > 
> > 
> > _______________________________________________
> > osis-core mailing list
> > osis-core@bibletechnologieswg.org 
> > http://www.bibletechnologieswg.org/mailman/lis> tinfo/osis-core
> > 
> 
> _______________________________________________
> osis-core mailing list
> osis-core@bibletechnologieswg.org
> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
>