[sword-devel] language/locale codes

Wed Nov 11 04:22:09 MST 2009

Chris,
Thanks very much for this.

I'm wondering about a few things I have seen.
In some languages there are ASCII equivalents for accented forms. I'm thinking we probably shouldn't use the ASCII forms.
Example:
	 Bokmål is sometimes Bokmaal and even Bokmal
(The SIL files consistently use the accented form)

The iso-639-3_Name_Index.tab gives an inverted form. If localized.txt is giving the English form for a particular code and not the native, I think, from an English speakers perspective, these would be a preferred form to use in localized.txt as alphabetizing the names will put language families together.
For example, we have quite a few Zapotec modules with different language codes, whose language is of the form
	XYZ Zapotec
The inverted form is
	Zapotec, XYZ

In comparing names between these files and the locales.d/xxx-utf8.conf, I think there may be some corruption in the locales.d utf8 confs.
For example, nb-utf8.conf has
	[Meta]
	Name=nb
	Description=Bokm√•l (Unicode)
(I noticed yesterday, while working on something else, that perl may write UTF8 in this form when the "<:utf8", or its equivalent, is not used on creating a file handle for write.)

The reason I'm looking at Bokmål is that I am fixing a problem in JSword regarding it being wrongly encoded as Bokm√•l.

On Nov 11, 2009, at 1:46 AM, Chris Little wrote:

> Some of you have noticed my commits of language/locale data, and I wanted to outline my near-term plans and long-term hopes for that work.
> 
> First, it should be borne in mind that the whole situation is vastly more complex than most of us are probably aware. Long ago, we adopted IETF's locale naming convention for identifying languages, which was defined in RFC 1766. This was obsoleted in favor of RFC 3066, which was in turn obsoleted in favor of BCP 47, which currently identifies RFCs 4646 and 4647.
> 
> RFC 4646 defines the syntax and sources of locale subtags. For the language subtag, tags are defined by (in order of preference): IANA, ISO 639-1, ISO 639-2/T, ISO 639-3, and ISO 639-5. Script subtags are defined by ISO 15924, and region (e.g. country) subtags are defined by ISO 3166-1.
> 
> I committed a bit of Perl that will grab the latest versions of all of this data from each registration authority (IANA, the Library of Congress, SIL, Unicode, and ISO), and I've got another pair of files in which I list the native names for many of the languages (copied from Xiphos' database) and list those code actually used by Sword modules or locales. I'm in the process of writing another script that will parse through all of the files and output a big, easy to parse, master database of locale data.
> 
> From that, we can generate up-to-date data for each front end in the format that it needs and according to the desires of each team. So if the BibleTime and Bible Desktop teams only want to ship locale data for modules that are already shipped, we can filter everything else out. If a team were to prefer the English names to native names, we could produce that.

My preference is to use a hierarchical approach.
For example, when looking up a code in a given locale xx-YY, first look for it in a file localized-xx-YY.txt, where xx is the language code and YY is the country code.
If the file does not exist and the code is not in that file, look for it in localized-xx.txt.
Failing that look in localized.txt.
Failing that do something graceful.

This is how JSword does it using Java's built in localization mechanism. For performance, the locale specific files are pruned to the set of codes that are in SWORD modules. The default file has all the languages. That way, if a new language code is used and it is not in the localized file, but is in the default, we don't have to hurry a release.

I'm glad to have the native form in the base/default file.

> 
> Longer term, it would be nice to push this functionality back into the library so that front ends can all share the benefits of updated data without having to deal with updates or the particulars and eventual changes in BCP 47. But that will obviously require API changes, which are not permitted, last I heard.

I think Troy was wanting 1.6.1 to be built from trunk and for it to be ABI compatible with 1.6.0.

In SVN, branching is cheap. The difficulty is doing the merge. You could create a branch for this new feature and work ahead of 1.6.1. Then when 1.6.1 is released, merge your work into trunk.

On another note, do you envision having distribution mechanism of these files apart from front-end releases, such as putting them in a known place for download?

In Him,
	DM