[sword-devel] Chinese Bible search program

David Burry sword-devel@crosswire.org
Tue, 12 Dec 2000 18:48:20 -0800

Hi Joel.

I'm a white American who spent 4 years in mainland China studying Chinese and teaching English among other "nefarious" religious activities ;o)  I have had a great interest in programming since I was 12 years old and now work as a Senior Web Software Developer for Adobe Systems in California.

I too have started an independent (non crosswire) Bible reading and searching program written in Perl, using Berkley DB for the indexing.  You can see a mostly-functional web-based front end example of it at http://beaver.dburry.com/cgi-perl/bible  Actually it's fully functional except some planned search operators (and parenthesis) aren't yet supported and there are some minor encoding/charset problems for non-English languages I haven't yet taken the time to fix (though I've made great strides in understanding them)  ;o)  I am planning on putting Chinese language search capability into it as well as soon as I get a chance.  I'm happy to help you with any pointers you would like about the language, plus I would appreciate if you would send me your full text and also tell me where you got it from and whether it's copyrighted or not (I won't be able to post it publicly on this non-profit-org's web site I volunteer for if it's copyrighted and I don't have written permission).

Chinese is a pictographical-based language, not an alphabetical one.  There are thousands of completely different Chinese characters.  In my experience both working with the Chinese people and learning the language, each individual two-byte character is in fact a separate word, complete with its own meaning.  Therefore the ABC D and AB CD things you refer to are really compound words made up of smaller ones and therefore the easiest thing to do is to just treat them as 4 separate words A B C and D.  An interesting thing to note is that there are no new words being formed in Chinese today, only new compound words made up of smaller existing single-character words.  Think of it as being like the words "desktop" and "cupboard" being composed of "desk" and "top" and "cup" and "board" respectively.  For instance, in Chinese there are two common words for "computer" which literally mean "calculating calculating machine" or "electric brain," composed of 3 and 2 characters respectively (ok, the first one has two different characters in it that mean the same thing but you get the idea).  See http://beaver.dburry.com/cedict/ for a very simple grep-based dictionary.  An easy thing to do is to just support phrase searches and treat these compound words as phrases in a natural way.  That's what I'm planning on doing.

I am planning on developing my Bible search program to eventually store and index the data in UTF-8 encoding format, and give the option for translating it out into other character sets on the fly if people need it.  The reason for this is I want to make it so that people can incorporate its output into their own web sites, which may be in other encoding formats if their main audience is from that locale.  I do have a printed reference manual full of tables and charts and English and Chinese explanations of the GB2312 encoding (commonly used in mainland China and elsewhere where simplified characters are used) but not yet for the Big5 encoding (commonly used in Taiwan and elsewhere where traditional characters are used).  This is useful if you need to parse the text looking for verse boundaries, etc.

For general searching and indexing issues I highly recommend the book "Managing Gigabytes," by Witten, Moffat and Bell.


At 12:55 PM 12/12/2000 -0800, Joel Mawhorter wrote:
>Hello everyone,
>I have written to this list a few times in the past about supporting various 
>languages such as Chinese, Arabic, etc. in Bible search software. I have 
>decided that the best way to support some of these languages is to write 
>software specifically for that purpose rather than extending a project such 
>as Sword. Some of the requirements for these languages are very different 
>than for English-like languages. I am in my last year of my computer science 
>undergrad and I am doing a project course. I decided to do a Chinese Bible 
>program for this course. I am still in early development (all I really have 
>so far is the Chinese Bible in an acceptable format and the full text index 
>completed). As an aside, Chinese is very interesting to index because there 
>are no spaces between words in Chinese. As well, manual segmentation of 
>Chinese into words can produce different results with different human 
>segmentors (i.e. ABCD might be segmented ABC D by one person and AB CD by 
>another). As a result most of my work so far has been researching how best to 
>index Chinese. I hope to have something functional fairly soon.
>Troy, do you think this is something that could be brought under the umbrella 
>of Crosswire.
>Also, is there anyone on this list who reads Chinese who would be willing to 
>assist me with suggestions, testing, etc.
>My goal is to make this program very simple (i.e. no texts other than the 
>Bible, no pictures, no formatted text, etc.). However, I want to make the 
>searching capability as powerful as possible. I have read a few good 
>discussions on this list in the past about searching so I thought I would 
>solicit some suggestions. My current plan is to implement AND, OR, NOT, 
>wildcard, proximity and phrase searching. I would love to hear any 
>suggestions that people might have about this. Specifically, I am unsure 
>whether to implement NOT as a general operator or only AND NOT. For example, 
>the former would allow a search such as "NOT (Love | Joy | Peace)" which 
>would find all verses not containing one of those three words. The latter 
>would only allow searches such as "Love AND NOT Peace". My intent with the 
>proximity operator is to allow people to search for two words which occur 
>within x verses for each other. Should I also allow people to search for two 
>words which occur within x words of each other? (This doesn't even really 
>make much sense for Chinese but I'm thinking ahead for other languages). 
>Also, how useful is XOR since most people have no idea what it is and those 
>who do probably know that "a XOR b" can be written as "(a AND NOT b) | (b AND 
>NOT a)".
>Any other suggestions that people have, especially regarding searching would 
>be appreciated.