[sword-devel] French ligatures in Louis SÉGOND’s text

Chris Little chrislit at crosswire.org
Sun Jul 15 23:16:03 MST 2007

Leandro Guimarães Faria Corcete DUTRA wrote:
> Chris Little <chrislit at crosswire.org> writes:
>> We could change oe to oe-ligature where appropriate in Louis Segond. 
>> That would be simple enough since editions exist online that use 
>> oe-ligature correctly.
> 	Also, it is not that many words using that… cœur, sœur, mœur…
>         Is there anyone to do it already, or should I do it?

WikiSource already has a copy with oe-lig that we could use. No need to 
repeat the work.

>> However, since we won't be doing language-specific search tweaks
> 	That is not what I meant — I mean a general fix, where ligatures at
> the search box would find expanded characters, and vice‐versa.  Just like
> Google does it, with all kind of European ligatures.

There's a simplistic solution for searching like you suggest by 
decomposing ligatures as their components as part of the strip filter 
process. That will work fine for French, I suppose, and Latin but it 
would return incorrect results in other languages. In Norwegian, 
ae-ligature is a letter on its own, not related to a or e. In Swedish 
the same letter is written as a-umlaut. In Icelandic, oe-ligature 
shouldn't be decomposed to oe either.

Should umlauted letters be decomposed also? So a-umlaut becomes ae, 
o-umlaut becomes oe, u-umlaut becomes ue--which works fine for German, 
but I doubt for many other languages. And what about i-umlaut and 
e-umlaut? And what about letters with accents? Some languages would 
simply drop the accent, others would double the letter, and there may be 
other behaviors I don't know about.

The only ligatures that we could safely decompose without reference to 
language are typographic ligatures, and we would never encode those as 
ligatures in the first place.

I don't know how Google does what they do. They may do language 
identification and language-specific processing of documents. But they 
have a lot more data and horsepower at their disposal than we do.

> 	In the end it is an Unicode question, I guess?

It's not a Unicode question because Unicode doesn't deal with this 
issue. The decomposition of oe-ligature to oe would be a 
language-specific detail and is not encoded in any of Unicode's data sets.

>> since oe-ligature basicallly can't be typed on French keyboards
> 	Yes, but regardless of keyboards us GNU/Linux users who love
> typography (admittedly a small subset) have it mapped and used it quite often.

I'm understandably more concerned with Windows users who would lose 


More information about the sword-devel mailing list