[sword-devel] clucene UTF-8 to wchar_t * (or TCHAR *) (was: 1.6.1 final call)
Troy A. Griffitts
scribe at crosswire.org
Thu Dec 24 23:59:11 MST 2009
It looks like the lucene_utf8towcs method we are using from
CLucene/config/utf8.cpp was provided by RedHat and probably had no
intention to work on Windows:
We don't need to use these conversion routines if we knew exactly what
clucene wants in a TCHAR *. UTF-32? We have methods in SWORD to do
conversions without requiring a static buffer.
If I knew exactly what clucene expected in the Field c-tor then I could
convert the buffer with one of our methods and supply it to clucene.
If I understand things correctly, Win32 has historically defined wchar_t
to 16 bits because their 'w' methods take UCS-2 (Windows 2000) or UTF-16
After examining the lucene_utf8towcs method (and consequently the
lucene_utf8towc method) impls, it looks like it can only return a single
wchar_t for a UTF-8 encoded character. This means that it cannot be
proper UTF-16 for Windows (never multi-wchar_t) (unless I am missing
In SWORD we never use wchar_t for this reason-- it is ambiguous. When
support was added to SWORD for clucene, clucene's methods took both
wchar_t (lucene_utf8towcs) and TCHAR types. I am not sure the
difference but hope they eventually become the same thing on the same
Since clucene provides its own conversion methods for us from UTF-8 to,
presumably, whatever clucene ultimately wants, we used them so we didn't
have to know what encoding clucene ultimately wanted.
If it were up to me, I would replace all wchar_t types in clucene with
TCHAR and define TCHAR as int32_t or equiv for all platforms, and remove
all ambiguity. However, that is not up to me.
More information about the sword-devel