[sword-devel] clucene UTF-8 to wchar_t * (or TCHAR *) (was: 1.6.1 final call)

Troy A. Griffitts scribe at crosswire.org
Thu Dec 24 23:59:11 MST 2009


It looks like the lucene_utf8towcs method we are using from
CLucene/config/utf8.cpp was provided by RedHat and probably had no
intention to work on Windows:

http://www.google.com/codesearch/p?hl=en#7HljlF5wh14/trunk/clucene-core-0.9.21/src/CLucene/config/utf8.cpp&q=lucene_utf8towcs&d=2

We don't need to use these conversion routines if we knew exactly what
clucene wants in a TCHAR *.  UTF-32?  We have methods in SWORD to do
conversions without requiring a static buffer.

If I knew exactly what clucene expected in the Field c-tor then I could
convert the buffer with one of our methods and supply it to clucene.

If I understand things correctly, Win32 has historically defined wchar_t
to 16 bits because their 'w' methods take UCS-2 (Windows 2000) or UTF-16
(later).

After examining the lucene_utf8towcs method (and consequently the
lucene_utf8towc method) impls, it looks like it can only return a single
wchar_t for a UTF-8 encoded character.  This means that it cannot be
proper UTF-16 for Windows (never multi-wchar_t) (unless I am missing
something).

In SWORD we never use wchar_t for this reason-- it is ambiguous.  When
support was added to SWORD for clucene, clucene's methods took both
wchar_t (lucene_utf8towcs) and TCHAR types.  I am not sure the
difference but hope they eventually become the same thing on the same
platform.

Since clucene provides its own conversion methods for us from UTF-8 to,
presumably, whatever clucene ultimately wants, we used them so we didn't
have to know what encoding clucene ultimately wanted.

If it were up to me, I would replace all wchar_t types in clucene with
TCHAR and define TCHAR as int32_t or equiv for all platforms, and remove
all ambiguity.  However, that is not up to me.



More information about the sword-devel mailing list