[sword-devel] indexed search discrepancy

Fri Aug 28 21:39:36 MST 2009

On Fri, Aug 28, 2009 at 7:12 PM, Troy A. Griffitts<scribe at crosswire.org> wrote:
>
> Matthew Talbert wrote:
>> TCHAR is even more ambigous than wchar-t. if UNICODE is defined then
>> TCHAR is wchar-t. otherwise, it is plain char. I'm away form my
>> computer but clucene is definitely converting to utf16 or utf32
>> depending on platform. so i think it is always proper unicode. one way
>> or another, the field needs to be converted to a wchar-t containing
>> utf 16/32
>
> Thanks again Matthew.  Can you confirm what I think you've said above:
>
> clucene checks the platform (maybe with something like sizeof(wchar_t))
> and then converts to UTF-8 stream to either a UTF-16 or a UTF-32 encoded
> stream?  This is hard for me to understand, but what I think you've
> stated.  Here's why.
>
>
> You may understand this, but just to make sure, converting from a
> variable-character-length stream like UTF-8 to 16-bit values is not UTF-16.
>
> There are only a few choices lucene_utf8towc can return: 32-bits,
> 16-bits, some other crazy thing.
>
> *** 32-bits:
> If lucene_utf8towc always returns a single 32-bit value to represent the
> given UTF-8 character, then clucene can handle the full range of unicode
> and we still have investigation to do into what lucene_utf8towcs does
> with the return value from lucene_utf8wc.
>
> *** 16-bits:
> If lucene_utf8towc always returns either a 16-bit or 32-bit single
> value, and presuming the comment to the method to be true, we should be
> able to conclude that clucene cannot handle the full range of unicode
> characters on platforms that define wchar_t as 16-bits.  16 bits is not
> enough bits to represent all unicode values in a single value.
>
> ***  some other crazy thing:
> If lucene_utf8towc somehow can return multiple 16-bit values to
> represent a single character (not sure how it could do this AND have the
> comment to the method still be true without a crazy return object
> (list<wchar_t>?)) then indeed how I understand your assessment makes
> sense: clucene checks the platform (maybe with something like
> sizeof(wchar_t)) and then converts to UTF-8 stream to either a UTF-16 or
> a UTF-32 encoded stream
>
> So, just to confirm, does lucene_utf8towc really have some way of return
> multi-values for a single unicode character on platforms that define
> wchar_t as 16-bits?
>
> Since clucene uses wchar_t, my expected conclusion would have been (***
> 16-bits), above: full range supported on linux, 16-bits of glyph-space
> supported on windows.
>
> Thanks again.  Please don't rush to a computer to investigate if you're
> not sure.  I also can pull the source for clucene down when I get home
> tonight.
>
>        -Troy.

OK, I'm still unclear on what's happening after spending time digging
through the source. The part that poses the biggest problem is that
lucene_utf8towc appears (to me) to be getting a correct, 32-bit value
which it stores as an int. However, it then assigns this int directly
to wchar_t. So if I'm understanding this correctly, then if the
Unicode value happens to be too big to store in 16-bits, then this
would be incorrect. It would essentially cause data corruption, yes?

However, there may be more going on here than just this. For instance,
there's a "repl_wchar.h" file, a "PlatformWin32.h" file, other config
files all spending a good deal of code on things like _UCS2. So it's
possible that somehow it works correctly. However, the fact that I got
different search results on Windows would indicate that there is
certainly a difference. I got more results, so would that indicate
that wchar_t is bigger or smaller? I think it would mean smaller.

Matthew