No subject

Thu Oct 29 11:12:40 MST 2009

It supports conversion of UTF-8 to a 32-bit Unicode character stream on
linux (and other platforms that define wchar_t to 32 bits) just fine.

It will simply not work on Windows for values greater than 16-bit.

My support of this conclusion is from the impl of this method:

size_t lucene_utf8towc(wchar_t *pwc, const char *p, size_t n)
  int i, mask = 0;
  int result;
  unsigned char c = (unsigned char) *p;
  int len=0;

  UTF8_COMPUTE (c, mask, len);
  if (len == -1)
    return 0;
  UTF8_GET (result, p, i, mask, len);

  *pwc = result;
  return len;

Notice that it assigns to *pwc (wchar_t) the value of result (int).

Not sure what we should do about this.

We can use our methods to convert UTF-8 to UTF-32 (a.k.a. UCS-4) and
send that to clucene, which should work fine for clucene on systems that
define wchar_t to 32-bit, but will fail miserably on Windows.

Maybe we can get the clucene folks opinion on this?  Maybe I've
completely misunderstood the situation; otherwise, maybe we can offer to
clean this up for them.


Matthew Talbert wrote:
> OK, I am still not understanding why there is an issue, or what the
> real cause of the issue is. However, this line I think will work:
> const unsigned int MAX_CONV_SIZE = 6536 * sizeof(wchar_t) * sizeof(wchar_t);
> If somebody can come up with an actual explanation for why there is a
> problem, and a non-hackish solution, that would be great.
> Just for the record, wchar_t is 16 bits on win32 and 32 bits on *nix.
> So, if I'm thinking correctly (and I won't guarantee that right now),
> this should give the equivalent of 1024 * 1024;
> Matthew
> _______________________________________________
> sword-devel mailing list: sword-devel at
> Instructions to unsubscribe/change your settings at above page

More information about the sword-devel mailing list