[sword-devel] Chinese Bible search program

Joel Mawhorter sword-devel@crosswire.org
Tue, 12 Dec 2000 12:55:36 -0800


Hello everyone,

I have written to this list a few times in the past about supporting various 
languages such as Chinese, Arabic, etc. in Bible search software. I have 
decided that the best way to support some of these languages is to write 
software specifically for that purpose rather than extending a project such 
as Sword. Some of the requirements for these languages are very different 
than for English-like languages. I am in my last year of my computer science 
undergrad and I am doing a project course. I decided to do a Chinese Bible 
program for this course. I am still in early development (all I really have 
so far is the Chinese Bible in an acceptable format and the full text index 
completed). As an aside, Chinese is very interesting to index because there 
are no spaces between words in Chinese. As well, manual segmentation of 
Chinese into words can produce different results with different human 
segmentors (i.e. ABCD might be segmented ABC D by one person and AB CD by 
another). As a result most of my work so far has been researching how best to 
index Chinese. I hope to have something functional fairly soon.

Troy, do you think this is something that could be brought under the umbrella 
of Crosswire.

Also, is there anyone on this list who reads Chinese who would be willing to 
assist me with suggestions, testing, etc.

My goal is to make this program very simple (i.e. no texts other than the 
Bible, no pictures, no formatted text, etc.). However, I want to make the 
searching capability as powerful as possible. I have read a few good 
discussions on this list in the past about searching so I thought I would 
solicit some suggestions. My current plan is to implement AND, OR, NOT, 
wildcard, proximity and phrase searching. I would love to hear any 
suggestions that people might have about this. Specifically, I am unsure 
whether to implement NOT as a general operator or only AND NOT. For example, 
the former would allow a search such as "NOT (Love | Joy | Peace)" which 
would find all verses not containing one of those three words. The latter 
would only allow searches such as "Love AND NOT Peace". My intent with the 
proximity operator is to allow people to search for two words which occur 
within x verses for each other. Should I also allow people to search for two 
words which occur within x words of each other? (This doesn't even really 
make much sense for Chinese but I'm thinking ahead for other languages). 
Also, how useful is XOR since most people have no idea what it is and those 
who do probably know that "a XOR b" can be written as "(a AND NOT b) | (b AND 
NOT a)".

Any other suggestions that people have, especially regarding searching would 
be appreciated.

Thanks,

Joel