[sword-devel] Chinese Bible search program

Joel Mawhorter sword-devel@crosswire.org
Wed, 13 Dec 2000 00:09:25 -0800


On Tuesday 12 December 2000 18:48, you wrote:
> Hi Joel.
>
> I'm a white American who spent 4 years in mainland China studying Chinese
> and teaching English among other "nefarious" religious activities ;o)  I
> have had a great interest in programming since I was 12 years old and now
> work as a Senior Web Software Developer for Adobe Systems in California.
>
> I too have started an independent (non crosswire) Bible reading and
> searching program written in Perl, using Berkley DB for the indexing.  You
> can see a mostly-functional web-based front end example of it at
> http://beaver.dburry.com/cgi-perl/bible  Actually it's fully functional
> except some planned search operators (and parenthesis) aren't yet supported
> and there are some minor encoding/charset problems for non-English
> languages I haven't yet taken the time to fix (though I've made great
> strides in under tanding them)  ;o)  I am planning on putting Chinese
> language search capability into it as well as soon as I get a chance.  I'm
> happy to help you with any pointers you would like about the language, plus
> I would appreciate if you would send me your full text and also tell me
> where you got it from and whether it's copyrighted or not (I won't be able
> to post it publicly on this non-profit-org's web site I volunteer for if
> it's copyrighted and I don't have written permission).

I'm glad to hear you are working on something too. The more the merrier! The 
program I am writting will just be a standalone program so it will fill a 
somewhat different need from the one you are writing. I am writing the 
program in Java because of the great Java support for Unicode and the cross 
platform support.

The full text I got was from this page: http://www2.ccim.org/~bible/dcb.html

I used the file "bible.b5" since it seemed the cleanest (a number of them had 
a lot of garbage characters in the text). There is no copyright information 
with the file, just the contact information for an organization. I contacted 
them but no-one I talked to could tell me anything about the file and told me 
to call back later for someone else. I haven't yet done that. I'll let you 
know what I find out. You can also look at www.gospelcom.net/ibs/ for a pdf 
version of the chinese Bible that is free to download but not to 
redistribute. Have you found any texts available? Nearly every text I have 
found is of the Glory Union Version Translation (also called the Ho Ho 
Version). I talked to the local Chinese Christian group on campus and that is 
the version that they use.

> Chinese is a pictographical-based language, not an alphabetical one.  There
> are thousands of completely different Chinese characters.  In my experience
> both working with the Chinese people and learning the language, each
> individual two-byte character is in fact a separate word, complete with its
> own meaning.  Therefore the ABC D and AB CD things you refer to are really
> compound words made up of smaller ones and therefore the easiest thing to
> do is to just treat them as 4 separate words A B C and D.  An i teresting
> thing to note is that there are no new words being formed in Chinese today,
> only new compound words made up of smaller existing single-character words.
>  Think of it as being like the words "desktop" and "cupboard" being
> composed of "desk" and "top" and "cup" and "board" respectively.  For
> instance, in Chinese there are two common words for "computer" which
> literally mean "calculating calculating machine" or "electric brain,"
> composed of 3 and 2 characters respectivel! y (ok, the first one has two
> different characters in it that mean the same thing but you get the idea). 
> See http://beaver.dburry.com/cedict/ for a very simple grep-based
> dictionary.  An easy thing to do is to just support phrase searches and
> treat these compound words as phrases in a natural way.  That's what I'm
> planning on doing.

The problem with treating ABCD as seperate words and just doing single 
character indexing is that a search for BC will turn up that verse even if 
the ABCD can be unambiguously seperated into the compound words AB and CD. 
Therefore the search will return a false positive. However, the problem with 
word indexing is that, if the seperation of ABCD is ambiguous, a user may 
search for CD and be told that term is not found because I indexed ABCD as 
the words ABC and D. Becuase I consider false positives to be better than 
false negatives in this case, I am going with single charater indexing. Then 
when the user searches for AB, I will get the references from the index for A 
and B then AND the lists. After that I will have to search the resulting set 
to find only those verses which contain AB next to each other. This may prove 
to be to slow. If so, I will go with a hybrid character/bigram index or a 
character index with next character information in the index.

> I am planning on developing my Bible search program to eventually store and
> index the data in UTF-8 encoding format, and give the option for
> translating it out into other character sets on the fly if people need it. 
> The reason for this is I want to make it so that people can incorporate its
> output into their own web sites, which may be in other encoding formats if
> their main audience is from that locale.  I do have a printed reference
> manual full of tables and charts and English and Chinese explanations of
> the GB2312 encoding (commonly used in mainland China and elsewhere where
> simplified characters are used) but not yet for the Big5 encoding (commonly
> used in Taiwan and elsewhere where traditional characters are used).  This
> is useful if you need to parse the text looking for verse boundaries, etc.

Since I am using Java, Unicode makes most sense for me so I converted the 
text to UTF-8. I can send that to you if you would like. 

> For general searching and indexing issues I highly recommend the book
> "Managing Gigabytes," by Witten, Moffat and Bell.

I spent the first part of the semester reading the text index and query 
sections of that book before jumping into Chinese text indexing.

Thanks for the info.

Joel

> Dave
>
> At 12:55 PM 12/12/2000 -0800, Joel Mawhorter wrote:
> >Hello everyone,
> >
> >I have written to this list a few times in the past about supporting
> > various languages such as Chinese, Arabic, etc. in Bible search software.
> > I have decided that the best way to support some of these languages is to
> > write software specifically for that purpose rather than extending a
> > project such as Sword. Some of the requirements for these languages are
> > very different than for English-like languages. I am in my last year of
> > my computer science undergrad and I am doing a project course. I decided
> > to do a Chinese Bible program for this course. I am still in early
> > development (all I really have so far is the Chinese Bible in an
> > acceptable format and the full text index completed). As an aside,
> > Chinese is very interesting to index because there are no spaces between
> > words in Chinese. As well, manual segmentation of Chinese into words can
> > produce different results with different human segmentors (i.e. ABCD
> > might be segmented ABC D by one person and AB CD by another). As a result
> > most of my work so far has been researching how best to index Chinese. I
> > hope to have something functional fairly soon.
> >
> >Troy, do you think this is something that could be brought under the
> > umbrella of Crosswire.
> >
> >Also, is there anyone on this list who reads Chinese who would be willing
> > to assist me with suggestions, testing, etc.
> >
> >My goal is to make this program very simple (i.e. no texts other than the
> >Bible, no pictures, no formatted text, etc.). However, I want to make the
> >searching capability as powerful as possible. I have read a few good
> >discussions on this list in the past about searching so I thought I would
> >solicit some suggestions. My current plan is to implement AND, OR, NOT,
> >wildcard, proximity and phrase searching. I would love to hear any
> >suggestions that people might have about this. Specifically, I am unsure
> >whether to implement NOT as a general operator or only AND NOT. For
> > example, the former would allow a search such as "NOT (Love | Joy |
> > Peace)" which would find all verses not containing one of those three
> > words. The latter would only allow searches such as "Love AND NOT Peace".
> > My intent with the proximity operator is to allow people to search for
> > two words which occur within x verses for each other. Should I also allow
> > people to search for two words which occur within x words of each other?
> > (This doesn't even really make much sense for Chinese but I'm thinking
> > ahead for other languages). Also, how useful is XOR since most people
> > have no idea what it is and those who do probably know that "a XOR b" can
> > be written as "(a AND NOT b) | (b AND NOT a)".
> >
> >Any other suggestions that people have, especially regarding searching
> > would be appreciated.
> >
> >Thanks,
> >
> >Joel