[sword-devel] French ligatures in Louis SÉGOND’s text

Mon Jul 16 05:16:52 MST 2007

On Jul 16, 2007, at 2:16 AM, Chris Little wrote:

> Leandro Guimarães Faria Corcete DUTRA wrote:
>> Chris Little <chrislit at crosswire.org> writes:
>>
>>> We could change oe to oe-ligature where appropriate in Louis Segond.
>>> That would be simple enough since editions exist online that use
>>> oe-ligature correctly.
>>
>> 	Also, it is not that many words using that… cœur, sœur, mœur…
>>
>>         Is there anyone to do it already, or should I do it?
>
>
> WikiSource already has a copy with oe-lig that we could use. No  
> need to
> repeat the work.
>
>>> However, since we won't be doing language-specific search tweaks
>>
>> 	That is not what I meant — I mean a general fix, where ligatures  
>> at
>> the search box would find expanded characters, and vice‐versa.   
>> Just like
>> Google does it, with all kind of European ligatures.
>
>
> There's a simplistic solution for searching like you suggest by
> decomposing ligatures as their components as part of the strip filter
> process. That will work fine for French, I suppose, and Latin but it
> would return incorrect results in other languages. In Norwegian,
> ae-ligature is a letter on its own, not related to a or e. In Swedish
> the same letter is written as a-umlaut. In Icelandic, oe-ligature
> shouldn't be decomposed to oe either.

I don't think our search results have to be "perfect".

Any searching of anything but the exact representation of the text  
can bring back "wrong" or "unexpected" results. Unexpected results  
also happen when the user does not have "proper" expectations.

Doesn't ICU have locale sensitive decomposition (or transliteration)?  
If it does then why can't we use the language of the module to set  
the locale then decompose. This is what we are planning to do for  
JSword (it has been on the todo list for years).

Lucene has the capability of boosting the importance of search terms  
as part of it's scoring mechanism for prioritizing search results. By  
boosting the user's "as-is" terms and ORing that with a normalization  
against a normalized field, perhaps with a lowered boost factor, it  
probably would give closer to expected results.

>
> Should umlauted letters be decomposed also? So a-umlaut becomes ae,
> o-umlaut becomes oe, u-umlaut becomes ue--which works fine for German,
> but I doubt for many other languages. And what about i-umlaut and
> e-umlaut? And what about letters with accents? Some languages would
> simply drop the accent, others would double the letter, and there  
> may be
> other behaviors I don't know about.
>
> The only ligatures that we could safely decompose without reference to
> language are typographic ligatures, and we would never encode those as
> ligatures in the first place.
>
> I don't know how Google does what they do. They may do language
> identification and language-specific processing of documents. But they
> have a lot more data and horsepower at their disposal than we do.
>
>> 	In the end it is an Unicode question, I guess?
>
> It's not a Unicode question because Unicode doesn't deal with this
> issue. The decomposition of oe-ligature to oe would be a
> language-specific detail and is not encoded in any of Unicode's  
> data sets.
>
>>> since oe-ligature basicallly can't be typed on French keyboards
>>
>> 	Yes, but regardless of keyboards us GNU/Linux users who love
>> typography (admittedly a small subset) have it mapped and used it  
>> quite often.
>
> I'm understandably more concerned with Windows users who would lose
> functionality.
>
> --Chris