[sword-devel] New module: Chinese dictionary

Christian Renz sword-devel@crosswire.org
Mon, 10 Jun 2002 13:36:19 +0800


first of all -- I discovered the Sword Project a few days ago, and I
am very impressed by the huge amount of modules available. Praise God!

I am working on creating dictionaries for the Sword Project, using the
CEDICT project (a freely available Chinese dictionary; I already
contacted the author about his copyright terms and hope he'll rely
soon). As you might know, the standard Chinese (Mandarin) uses the
Pinyin system for transliteration. Therefore, I created three
dictionaries, so that words can be searched by english translation,
characters and pinyin. (This is for simplified characters, once that
works, I'll do the same for traditional characters.) I converted the
GB2312 dictionary file to UTF-8, then used a perl script to generate
the dictionary files (calling addld).

So far, so good. The dictionary has 15000 to 20000 entries (depening
on direction) and the pinyin and english dictionaries work well. I
still have to do some formatting (tone marks, nice layout etc).

Now, after this lengthy introduction, on to my questions and issues
(anybody still reading?):

(general issues) 

- Has anybody come up with an utility to add more than one entry? It
  should be easy to modify addld to read its input from a file, but I
  don't have the time to do the programming right now, and I was
  hoping that somebody already did that. On my slow little Linux
  server, creating the dictionaries takes about fifty minutes -- just
  because the script has to start tens of thousands of processes!

- What are the ThML tags for formatting available in the Sword Project
  viewers? Is there something like a table tag? I'd like to group
  entries e.g. by same pronounciation. Also, does the big tag work? It
  would be useful to display the characters in a bigger typeface.

(Issues with the windows version of the Sword Project and the Glory
Union Bible, Simplified Characters)

- There is some confusion with the code tables used to display text in
  the windows software. Apparently, in the bible text windows the text
  is displayed as GB2312 rather than UTF-8. The search combo box uses
  the system font, which is not capable of displaying anything else
  than iso8859-1 on my machine, so I don't see character at all. (Have
  to try out Windows 2000, though. Might be better on that platform.)

  What is the situation for other platforms (Linux, Mac OS X)? Is the
  text of the Glory Union Bible displayed as GB2312 also? If yes, I
  could try to keep the definitions as UTF-8, but encode the search
  terms as GB2312.

- When looking up a term, it is displayed in the upper left corner, in
  a small, blue typeface. This is not useful for the characters
  dictionary, because the term is displayed in a Western encoding, not
  in Unicode... if I put the characters inside the definition, they
  are displayed just fine. Can the display of the search term in the
  definition window somehow be suppressed?

Thanks for your help!

Greetings and blessings,
   Christian Renz

crenz@web42.com - http://www.web42.com/crenz/ - http://www.web42.com/

"The worst attitude of all would be the professional attitude which regards
children in the lump as a sort of raw material which we have to handle."
    -- C.S. Lewis, On Three Ways of Writing for Children