[sword-devel] Thai and Lucene

Chris Little chrislit at crosswire.org
Tue Feb 15 02:28:42 MST 2005



Adrian Korten wrote:
> g'day,
> 
> I've been wondering whether Thai would benefit from Lucene. Even if it 
> does support utf-8, I doubt that Lucene supports Thai when no word 
> breaks are provided. Even if it had smarts to handle Thai word-breaking 
> like ICU, it would stumble over the Biblical words. Soooo, I haven't 
> tried it.

Hopefully someone who actually knows what Lucene indexes will answer 
this better (and especially correct me if I'm wrong), but I expect 
Lucene would benefit Thai searching somewhat because it can search 
within words, not just on full words. (By 'words' here, I'm using the 
definition of "words" in French: anything with whitespace on both sides.)

We also probably could pass text through the ICU Thai word-break 
iterator to add surrounding whitespace before we hand it to the Lucene 
indexer. Anyone more knowledgable know whether that would work (on the 
Lucene side).

> Is Lucene indexing primarily aimed at speeding up access to OSIS coded 
> text files? Or would it also work with the other formats? I've kept the 
> Thai modules in 'gbf' format to keep the file sizes down and search 
> speeds slightly faster.

Indexing works on Bible modules, regardless of format. Commentaries 
should work too. GenBooks didn't work last I tried and I haven't tried 
any dictionaries.

--Chris



More information about the sword-devel mailing list