[sword-devel] Thai and Lucene

Adrian Korten adrian_korten at sil.org
Mon Feb 14 19:45:52 MST 2005


I've been wondering whether Thai would benefit from Lucene. Even if it 
does support utf-8, I doubt that Lucene supports Thai when no word 
breaks are provided. Even if it had smarts to handle Thai word-breaking 
like ICU, it would stumble over the Biblical words. Soooo, I haven't 
tried it.

Is Lucene indexing primarily aimed at speeding up access to OSIS coded 
text files? Or would it also work with the other formats? I've kept the 
Thai modules in 'gbf' format to keep the file sizes down and search 
speeds slightly faster.


Troy A. Griffitts wrote:
> Adrian,
>     This is great news!!!  I'd love to see a screenshot if you could 
> post one somewhere.  I've not succeeded in getting anything other than 
> latin-based languages to show up on the menus in Windows 2000.
>     My guess about the characters which keep the .conf file from being 
> recognized... try adding a few newlines to the beginning of the file.  I 
> would guess that XXX[Section Name] at the beginning is just causing our 
> .conf reader to not recognize the "Section Name".
>     Bad news about the on/off and other preference items.  They are 
> pulled straight from the sword engine, and CAN be translated, but the 
> scheme we use in the windows frontend depends on these strings remaining 
> what the sword engine is expecting when toggling.  It's a simple fix-- 
> we just need to translate and keep a mapping between what the engine 
> wants and the translation.  But it's not something that I can get in 
> before this trip.  If you have a chance, it would be great to have a bug 
> item something like: "i18n sword preferences", in our bug tracker for 
> BibleCS.
>     I also don't think the new lucene indexing will support Thai 
> searching.  They have code in the lucene engine for UTF8, but I believe 
> it is very fresh, and it doesn't compile cleanly for me.  Hope they will 
> improve this shortly.  We do pass UTF8 in for both indexing and 
> searching, so it might incidently work :)
>     Thanks for the report!
>         -Troy.
> Adrian Korten wrote:
>> g'day,
>> Conf files converted to utf-8 do work for the UI. I ran into a problem 
>> at first that prevented the files from being read properly. I'm using 
>> TecKit, a program that can quickly convert files from code-page 
>> encodings to various unicode formats. It places three characters at 
>> the beginning of the file to indicate that this is a utf-8 encoded 
>> file (in hex 'EF BB BF'). MS programs add these characters as well 
>> when saving to a unicode text format. I had to manually remove these 
>> characters before Sword would recognize the conf files and include the 
>> Thai language as an option. This makes it difficult for Windows users 
>> to create their own UI files as it is not obvious what is causing the 
>> problem.
>> There seem to be some additional translations available for Window 
>> titles which is nice. Could you add 'on/off' as well if it has not 
>> been done yet? 'on/off' is used with the various switches available on 
>> the menu.
>> ak
>> Troy A. Griffitts wrote:
>>> Thanks for all the feedback again.
>>> Fixed GBFFootnotes filter which caused problems in ASV and others
>>> Fixed Start Each Verse On A New Line feature - it's located under 
>>> display preferences, the very first option.
>>> Changed CLucene indexing options to match those of JSword to see if 
>>> we can share indecies.  Thanks DM Smith for all the insight to the 
>>> better, non-default behaviour we now use!  Indexes are much smaller!
>>> Added a new lucene search parameter: strong
>>> You can now specify a searching using the strong: keyword. e.g.
>>> +God +love +world +strong:G123
>>> Jerry, we switched from SimpleAnalyzer to StandardAnalyzer, so maybe 
>>> the AND issue is fixed.  Let me know.  Also, if you still have 
>>> stability problems, please report them again.  I think I have them 
>>> all fixed.
>>> Bad news: you'll have to recreate your indexes.  But hopefully we're 
>>> slowing making it faster to create these.
>>> Thanks again for your time to test this stuff.  It's really 
>>> appeciated!  Keep the great feedback coming!
>>>     -Troy.
>>> http://crosswire.org/sword/ALPHAcckswwlkrfre22034820285912/alpha/sword-1.5.8pre3.zip 
>>> _______________________________________________
>>> sword-devel mailing list
>>> sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> _______________________________________________
>> sword-devel mailing list
>> sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
> _______________________________________________
> sword-devel mailing list
> sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel

More information about the sword-devel mailing list