[osis-core] OLAC 0.2

Patrick Durusau osis-core@bibletechnologieswg.org
Fri, 23 May 2003 16:28:45 -0400


http://www.language-archives.org/OLAC/olac-0.2.html

--
Patrick Durusau
Director of Research and Development
Society of Biblical Literature
Patrick.Durusau@sbl-site.org
Co-Editor, ISO 13250, Topic Maps -- Reference Model





OLAC 0.2: DC qualified with language codes
|------------------------------------+------------------------------------|
|                               Date:|   1 February 2001                  |
|------------------------------------+------------------------------------|
|                             Schema:|   http://www.language-archives.org/|
|                                    |   OLAC/olac-0.2.xsd                |
|------------------------------------+------------------------------------|
|                            Example:|   http://www.language-archives.org/|
|                                    |   OLAC/olac-0.2.xml                |
|------------------------------------+------------------------------------|




Overview

The agreement in Philadelphia was to use RFC 1766 with its extension
mechanism to permit Ethnologue codes of the form x-sil-AAA, replacing AAA
with the three letter code.  If/when the IETF endorses Ethnologue codes the
form will change to n-sil-AAA which will be a trivial switch to make.
Another agreement was to add a language qualifier to the subject element,
allowing focused searching on the language that is *described* by a
resource (as opposed to the language the resource is in). When a resource
describes multiple languages, we would use multiple instances of the
element.

This gives us four possibilities:
|-----------------+-----------------+-----------------+-----------------|
|   ELEMENT       |   REFINEMENT    |   QUALIFICATION |   CONTENT       |
|-----------------+-----------------+-----------------+-----------------|
|   language      |                 |                 |   a string      |
|                 |                 |                 |   (language of  |
|                 |                 |                 |   the resource) |
|-----------------+-----------------+-----------------+-----------------|
|   language      |                 |   rfc1766       |   an RFC 1766   |
|                 |                 |                 |   code          |
|-----------------+-----------------+-----------------+-----------------|
|   subject       |   language      |                 |   a string (the |
|                 |                 |                 |   language      |
|                 |                 |                 |   described)    |
|-----------------+-----------------+-----------------+-----------------|
|   subject       |   language      |   rfc1766       |   an RFC 1766   |
|                 |                 |                 |   code          |
|-----------------+-----------------+-----------------+-----------------|




In supporting this language information, there would be a two-stage process
for data providers:

   Provide unqualified Language and Subject.language elements. Often the
   language name appears in the Subject and/or Description elements, and
   needs to be pulled out.
   Add new language code fields to the back-end database, populate these
   fields, and export them in the OLAC format. A table with the SIL codes
   will be provided.

To sum up, this approach is: easy (since you can do nothing and conform);
non-parochial (since it uses RFC 1766); full-coverage (since it includes
Ethnologue codes); extensible (you can include your own scheme with codes
of the form x-SCHEME-CODE)

The XML Schema
|--------------------------------------------------------------------------|
|                                                                          |
|   <schema xmlns="http://www.w3.org/2000/10/XMLSchema"                    |
|           xmlns:olac="http://www.language-archives.org/OLAC/0.2/"        |
|           targetNamespace="http://www.language-archives.org/OLAC/0.2/"   |
|           elementFormDefault="qualified"                                 |
|           attributeFormDefault="unqualified">                            |
|                                                                          |
|     <annotation>                                                         |
|       <documentation>                                                    |
|         Schema for DC with qualifiers for language codes.                |
|         Steven Bird, 2/1/01                                              |
|         Schema validated at http://www.w3.org/2000/09/webdata/xsv        |
|         XSV 1.173.2.15.2.5/1.74.2.26 of 2001/01/15 14:18:55              |
|         Dublin Core semantics available at                               |
|   http://purl.org/DC/documents/rec-dces-19990702.htm                     |
|       </documentation>                                                   |
|     </annotation>                                                        |
|                                                                          |
|     <element name="olac" type="olac:olacType"/>                          |
|                                                                          |
|     <complexType name="olacType">                                        |
|       <choice minOccurs="0" maxOccurs="unbounded">                       |
|                                                                          |
|         <!-- Unqualified Dublin Core Elements -->                        |
|                                                                          |
|         <element name="title" minOccurs="0" maxOccurs="unbounded" type   |
|   ="string"/>                                                            |
|         <element name="creator" minOccurs="0" maxOccurs="unbounded" type |
|   ="string"/>                                                            |
|         <element name="subject" minOccurs="0" maxOccurs="unbounded" type |
|   ="string"/>                                                            |
|         <element name="description" minOccurs="0" maxOccurs="unbounded"  |
|   type="string"/>                                                        |
|         <element name="contributor" minOccurs="0" maxOccurs="unbounded"  |
|   type="string"/>                                                        |
|         <element name="publisher" minOccurs="0" maxOccurs="unbounded"    |
|   type="string"/>                                                        |
|         <element name="date" minOccurs="0" maxOccurs="unbounded" type    |
|   ="string"/>                                                            |
|         <element name="type" minOccurs="0" maxOccurs="unbounded" type    |
|   ="string"/>                                                            |
|         <element name="format" minOccurs="0" maxOccurs="unbounded" type  |
|   ="string"/>                                                            |
|         <element name="identifier" minOccurs="0" maxOccurs="unbounded"   |
|   type="string"/>                                                        |
|         <element name="source" minOccurs="0" maxOccurs="unbounded" type  |
|   ="string"/>                                                            |
|         <element name="relation" minOccurs="0" maxOccurs="unbounded" type|
|   ="string"/>                                                            |
|         <element name="coverage" minOccurs="0" maxOccurs="unbounded" type|
|   ="string"/>                                                            |
|         <element name="rights" minOccurs="0" maxOccurs="unbounded" type  |
|   ="string"/>                                                            |
|                                                                          |
|         <!-- Qualified Dublin Core Elements -->                          |
|                                                                          |
|         <element name="language" minOccurs="0" maxOccurs="unbounded" type|
|   ="olac:languageType"/>                                                 |
|         <element name="subject.language" minOccurs="0" maxOccurs         |
|   ="unbounded" type="olac:languageType"/>                                |
|                                                                          |
|       </choice>                                                          |
|     </complexType>                                                       |
|                                                                          |
|     <complexType name="languageType">                                    |
|       <simpleContent>                                                    |
|         <extension base="string">                                        |
|           <attribute name="identifier" use="optional" type               |
|   ="olac:rfc1766"/>                                                      |
|         </extension>                                                     |
|       </simpleContent>                                                   |
|     </complexType>                                                       |
|                                                                          |
|     <simpleType name="rfc1766">                                          |
|       <restriction base="string">                                        |
|         <pattern value="[a-zA-Z]+(-[a-zA-Z]+(-[a-zA-Z]+))"/>             |
|       </restriction>                                                     |
|     </simpleType>                                                        |
|                                                                          |
|   </schema>                                                              |
|                                                                          |
|--------------------------------------------------------------------------|




Example
|--------------------------------------------------------------------------|
|                                                                          |
|   <?xml version="1.0" encoding="UTF-8"?>                                 |
|   <olac                                                                  |
|     xmlns="http://www.language-archives.org/OLAC/0.2/"                   |
|     xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance"             |
|     xsi:schemaLocation="http://www.language-archives.org/OLAC/0.2/       |
|                   http://www.language-archives.org/OLAC/olac-0.2.xsd">   |
|     <title>ECI Multilingual Text</title>                                 |
|     <type>text</type>                                                    |
|                                                                          |
|   <identifier>http://morph.ldc.upenn.edu/Catalog/LDC94T5.html</identifier|
|   >                                                                      |
|     <date>1994-01-01</date>                                              |
|     <description>Applications: information retrieval, machine            |
|   translation, language modeling</description>                           |
|     <subject.language identifier="x-sil-BLG"/>   <!-- OLAC best practice |
|   -->                                                                    |
|     <subject.language identifier="x-sil-CHN">Chinese</subject.language>  |
|   <!-- redundant -->                                                     |
|     <subject.language identifier="EN">English</subject.language>  <!--   |
|   ISO 639 -->                                                            |
|     <subject.language>Danish</subject.language>  <!-- low-barrier for    |
|   entry -->                                                              |
|   </olac>                                                                |
|                                                                          |
|--------------------------------------------------------------------------|




Note that, as of early February 2001, XSV does not validate pattern
restrictions.

Recommended Best Practice

The OLAC recommended best practice for the identification of living and
recently dead languages is to use language and subject.language elements
with empty content, and with an identifier of the form x-sil-AAA where AAA
is an Ethnologue language code.

Mapping to Unqualified Dublin Core

   Drop the language refinement of the subject element and prepend
   "Language: " to the content.
   If there is an identifier but no content, look up the language name
   using the controlled vocabulary server to get a human-readable string,
   and make that the content.
   Drop the identifier attribute and append its value, parenthesized, to
   the content.

Support

Look up ethnologue codes using the search interface at
http://www.ethnologue.com/

The files are available here.  See Gary Simons' paper for the schemas.

   languagecodes.tab
   countrycodes.tab

References

   RFC 1766: Tags for the Identification of Languages
   http://www.ietf.org/rfc/rfc1766.txt
   RFC 3066: Tags for the Identification of Languages (replaces 1766)
   ftp://ftp.isi.edu/in-notes/rfc3066.txt
   ISO 639: Codes for the Representation of Names of Languages-Part 2:
   Alpha-3 Code
   http://lcweb.loc.gov/standards/iso639-2/langhome.html
   Gary Simons (2000). Language identification in metadata descriptions of
   language archive holdings
   http://www.ldc.upenn.edu/exploration/expl2000/papers/simons/simons.htm
   Ethnologue: Languages of the World
   http://www.sil.org/ethnologue/