<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

My experience is from perl and java, but it may have bearing.<br>

Collation is language dependent. English, French and German collate

their accented characters differently than each other. In Spanish "ch"

is sorted at the beginning of the "c" (though this may be changing).<br>

In Java collation uses the provided locale and failing that the

program's default locale, which unless set is the user's locale.<br>

I found that the same logic was needed to do a binary search.&nbsp; So if

ICU is needed for sorting, then ICU will be needed for a bin search.<br>

<br>

On a project I was on we had two fundamental requirements for a list of

40K+ international publication titles:<br>

1) For each supported locale, present the lists and sublists of

publications in the order that is appropriate for that locale.<br>

2) Provide efficient searching.<br>

<br>

To accomplish this we first had to normalize the name of each

publication. This requires knowing the language of the title of the

publication so that that languages stop words could be used (Het

Dagblad, and The Podunk Times needed to sort under Dagblad and Podunk

Times, respectively, because Het and The are stop words in their

languages.) We had decided that while an English speaker might look for

Het Dagblad under the "H" that the publication's locale was more

important. We had tried a universal list of stop words as the union of

every language's stop words, but that did not work LA could be Spanish

or it could be an abbreviation for Los Angeles, Die in English and

German are very different.<br>

We 0 padded numbers, removed stop words, single cased everything,

removed some punctuation, and removed redundant spacing. There were

other normalizations, but these are the obvious ones we can all think

of.<br>

We then created a text table with the normalized title, the original

title, the other columns were numeric sort keys for each supported

language.<br>

(This could have been done with parallel tables)<br>

This table was sorted on the normalized title but using a 8-bit ascii

collation.<br>

<br>

To do a search for an exact match, the user's input was normalized with

the exact same rules and then did a binary search.<br>

When the user wanted to do a free text search, we used something like

Lucene to index the titles. With each title was the normalized form.<br>

To sort a list of titles in the fashion that the user wants to see, we

used the appropriate column from the table (using the default column,

if the user's locale was not supported.)<br>

<br>

We ultimately used Java to do the collation because Perl's UTF-8

support was not quite there (5.6 was the latest version at the time)

and we found that we needed ICU for some of the more specialized rules

that I did not present here. And ICU was not supported for perl at the

time. I don't know where perl stands now.<br>

<br>

BTW, this is something that I could throw together in Java, if it is ok

to have some Sword tools in something other than C++.<br>

<br>

Daniel Glassey wrote:

<blockquote cite="mid30e46b3d05062215545b53addc@mail.gmail.com"

 type="cite">

  <pre wrap="">fwiw here's my opinion on what the standards should be. I definitely

agree that there should be standards.

On 22/06/05, Joachim Ansorg <a class="moz-txt-link-rfc2396E" href="mailto:nospam+sword-devel@joachim-ansorg.de">&lt;nospam+sword-devel@joachim-ansorg.de&gt;</a> wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">Hi,

I'm struggling with the unicode stuff of lexicons and lexicons in general.

Currently a frontend doesn't know whether to expect keys as utf8 or as

something else. because there's no standard defined. The same is valid of

GenBooks.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

It seems reasonable to me that all text, keys, everything in all types

of modules should be in UTF-8.

  </pre>

  <blockquote type="cite">

    <pre wrap="">Secondly, the sort oder is not valid for unicode if unicode characters are

used in the entry names.

That way unicode strings like the german "a umlaut" appear in the end, but

they should be among the firtst entries of the list. Sorting in the frontend

moves the lexicon intro somewhere into the middle of the list and is

slow(er).

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Unicode defines collation(sorting). 

<a class="moz-txt-link-freetext" href="http://www.unicode.org/reports/tr10/">http://www.unicode.org/reports/tr10/</a>

The entries should be sorted using something that implements the

algorithm by the module creation app. ICU should do the job and

doesn't have to be linked into the runtime lib to be able to do this.

It only needs to be linked into the module creation app. The way it

collates is language specific so it should get German right.

I think perl and python should also be able to do collation so they

are another option.

  </pre>

  <blockquote type="cite">

    <pre wrap="">Thirdly, the lexicon intro is a hack, it uses a lot of prepended spaces to be

in the first place of the list.

We need to find a better solution for that.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Agreed (sorry, I don't have one offhand)

  </pre>

  <blockquote type="cite">

    <pre wrap="">I'm missing defined standards for the API and the modules. That would make

frontend development a lot easier.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Agreed,

Daniel

_______________________________________________

sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>

<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>

Instructions to unsubscribe/change your settings at above page

  </pre>

</blockquote>

</body>

</html>