[sword-devel] clucene UTF-8 to wchar_t * (or TCHAR *) (was: 1.6.1 final call)

Fri Dec 25 05:43:57 MST 2009

Troy,

I don't think they have any intention of supporting multi-character
UTF-16. I did quite a bit of research on this a while back, most of
which can be found by going back and reading my emails about this. In
the end, UTF-16 with single characters is plenty good enough for the
vast majority of our use cases.

The problem is that their query analyzers simply won't deal with
multi-character encodings. So, it will not help things at all for us
to convert to real UTF-16, because it would break the internals of
clucene.

It is for reasons like this (and others) that QT has their own version
of clucene which they have ripped out the string bits and replaced
them with their own (UTF-16) qstring.

clucene has also the same problem that SWORD does, in that it does not
handle reading/writing from paths with certain Unicode characters.

We can't redefine TCHAR. TCHAR is defined as wchar_t when UNICODE is
defined. Otherwise it is char. Redefining it would cause all sorts of
havoc, on pretty much any platform.

I appreciate your desire to get this done properly, yet I have to find
some sort of solution very soon. I will probably lower the value used
by MAX_CONV_SIZE to a value that works on win32.

Matthew

On Fri, Dec 25, 2009 at 1:59 AM, Troy A. Griffitts <scribe at crosswire.org> wrote:
> It looks like the lucene_utf8towcs method we are using from
> CLucene/config/utf8.cpp was provided by RedHat and probably had no
> intention to work on Windows:
>
> http://www.google.com/codesearch/p?hl=en#7HljlF5wh14/trunk/clucene-core-0.9.21/src/CLucene/config/utf8.cpp&q=lucene_utf8towcs&d=2
>
> We don't need to use these conversion routines if we knew exactly what
> clucene wants in a TCHAR *.  UTF-32?  We have methods in SWORD to do
> conversions without requiring a static buffer.
>
> If I knew exactly what clucene expected in the Field c-tor then I could
> convert the buffer with one of our methods and supply it to clucene.
>
> If I understand things correctly, Win32 has historically defined wchar_t
> to 16 bits because their 'w' methods take UCS-2 (Windows 2000) or UTF-16
> (later).
>
> After examining the lucene_utf8towcs method (and consequently the
> lucene_utf8towc method) impls, it looks like it can only return a single
> wchar_t for a UTF-8 encoded character.  This means that it cannot be
> proper UTF-16 for Windows (never multi-wchar_t) (unless I am missing
> something).
>
> In SWORD we never use wchar_t for this reason-- it is ambiguous.  When
> support was added to SWORD for clucene, clucene's methods took both
> wchar_t (lucene_utf8towcs) and TCHAR types.  I am not sure the
> difference but hope they eventually become the same thing on the same
> platform.
>
> Since clucene provides its own conversion methods for us from UTF-8 to,
> presumably, whatever clucene ultimately wants, we used them so we didn't
> have to know what encoding clucene ultimately wanted.
>
> If it were up to me, I would replace all wchar_t types in clucene with
> TCHAR and define TCHAR as int32_t or equiv for all platforms, and remove
> all ambiguity.  However, that is not up to me.
>
> From my brief look at the code, I would guess that the current state of
> Unicode in clucene is thus:
>
> It supports conversion of UTF-8 to a 32-bit Unicode character stream on
> linux (and other platforms that define wchar_t to 32 bits) just fine.
>
> It will simply not work on Windows for values greater than 16-bit.
>
> My support of this conclusion is from the impl of this method:
>
> size_t lucene_utf8towc(wchar_t *pwc, const char *p, size_t n)
> {
>  int i, mask = 0;
>  int result;
>  unsigned char c = (unsigned char) *p;
>  int len=0;
>
>  UTF8_COMPUTE (c, mask, len);
>  if (len == -1)
>    return 0;
>  UTF8_GET (result, p, i, mask, len);
>
>  *pwc = result;
>  return len;
> }
>
>
> Notice that it assigns to *pwc (wchar_t) the value of result (int).
>
> Not sure what we should do about this.
>
> We can use our methods to convert UTF-8 to UTF-32 (a.k.a. UCS-4) and
> send that to clucene, which should work fine for clucene on systems that
> define wchar_t to 32-bit, but will fail miserably on Windows.
>
> Maybe we can get the clucene folks opinion on this?  Maybe I've
> completely misunderstood the situation; otherwise, maybe we can offer to
> clean this up for them.
>
> Troy
>
>
>
>
>
>
>
>
>
>
> Matthew Talbert wrote:
>> OK, I am still not understanding why there is an issue, or what the
>> real cause of the issue is. However, this line I think will work:
>>
>> const unsigned int MAX_CONV_SIZE = 6536 * sizeof(wchar_t) * sizeof(wchar_t);
>>
>> If somebody can come up with an actual explanation for why there is a
>> problem, and a non-hackish solution, that would be great.
>>
>> Just for the record, wchar_t is 16 bits on win32 and 32 bits on *nix.
>> So, if I'm thinking correctly (and I won't guarantee that right now),
>> this should give the equivalent of 1024 * 1024;
>>
>> Matthew
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>