[sword-devel] Chinese Bible search program

Kevin Brannen sword-devel@crosswire.org
Tue, 12 Dec 2000 20:09:35 -0600


Joel Mawhorter wrote:
> 
> Hello everyone,
> 
> I have written to this list a few times in the past about supporting various
> languages such as Chinese, Arabic, etc. in Bible search software. I have

Whooo!  Talk about taking up a challenge. :-)

> decided that the best way to support some of these languages is to write
> software specifically for that purpose rather than extending a project such
> as Sword. Some of the requirements for these languages are very different
> than for English-like languages. I am in my last year of my computer science

Please understand that I've never done I18N, but I have read about it in
conjunction with my GUI work in the past; so take this with a large volume of
salt... :-)

You have 2 basic problems from a display standpoint:  holding the data, and
"printing" the data in the proper direction.  (I am assuming the correct font
exists. :-)  In the X-Window world, and Motif specifically, the data is held
in arrays of wchar_t (wide char type), so you can put Unicode in it.  The
Motif XmText widget (via the XmString type) also has direction (i.e. left to
right or right to left), therefore, you can display Arabic, Hebrew, and so
forth in it.  I don't think top to bottom is supported. :-(  My point is,
you're going to need to find support for your I18N work in a GUI widget set of
some sort, and if you can find the proper widget to do that, your life will be
very easy from there on.  There are functions (at least in Motif) to help you
manipulate wchar_t data, so look for something like that too in whatever you
pick.

...
> Also, is there anyone on this list who reads Chinese who would be willing to
> assist me with suggestions, testing, etc.

Not me, but if you get stuck and can't find anyone else, let me know and I
have a friend who might.

> 
> My goal is to make this program very simple (i.e. no texts other than the
> Bible, no pictures, no formatted text, etc.). However, I want to make the
> searching capability as powerful as possible. I have read a few good
> discussions on this list in the past about searching so I thought I would
> solicit some suggestions. My current plan is to implement AND, OR, NOT,
> wildcard, proximity and phrase searching. I would love to hear any
> suggestions that people might have about this. Specifically, I am unsure
> whether to implement NOT as a general operator or only AND NOT. For example,
> the former would allow a search such as "NOT (Love | Joy | Peace)" which
> would find all verses not containing one of those three words. The latter
> would only allow searches such as "Love AND NOT Peace". My intent with the

>From Boolean Algebra, you don't need just NOT, the same functionality can
always be implemented with AND and OR.  So you could avoid that work, and put
something in Help that tells them this rule [in case you don't know, reverse
all operators, so your example becomes "Love & Joy & Peace"].  Your AND NOT
operator can't get any simpler, so if you want that functionality, you'll have
to put it in.  In a perfect world, AND NOT would be available. :-)

> proximity operator is to allow people to search for two words which occur
> within x verses for each other. Should I also allow people to search for two
> words which occur within x words of each other? (This doesn't even really
> make much sense for Chinese but I'm thinking ahead for other languages).

There's been a few times I could have used proximity. :-)  But it's probably
not worth it if it's too hard to implement.

> Also, how useful is XOR since most people have no idea what it is and those
> who do probably know that "a XOR b" can be written as "(a AND NOT b) | (b AND
> NOT a)".

That's a correct transformation, and no, *I* wouldn't bother implementing XOR.

Any other suggestions I would have are probably already on your list.  FWIW,
QuickVerse implements:  AND, OR, NOT, XOR, * (0 or more chars), ? (1 and only
1 char), and () for grouping.  It also does "case in/sensitivity" and "match
all word endings" (which might be nice, but is easily done with "*").

HTH,
Kevin