[sword-devel] indexed search discrepancy

Fri Aug 28 16:12:18 MST 2009

Matthew Talbert wrote:
> TCHAR is even more ambigous than wchar-t. if UNICODE is defined then
> TCHAR is wchar-t. otherwise, it is plain char. I'm away form my
> computer but clucene is definitely converting to utf16 or utf32
> depending on platform. so i think it is always proper unicode. one way
> or another, the field needs to be converted to a wchar-t containing
> utf 16/32

Thanks again Matthew.  Can you confirm what I think you've said above:

clucene checks the platform (maybe with something like sizeof(wchar_t))
and then converts to UTF-8 stream to either a UTF-16 or a UTF-32 encoded
stream?  This is hard for me to understand, but what I think you've
stated.  Here's why.

You may understand this, but just to make sure, converting from a
variable-character-length stream like UTF-8 to 16-bit values is not UTF-16.

There are only a few choices lucene_utf8towc can return: 32-bits,
16-bits, some other crazy thing.

*** 32-bits:
If lucene_utf8towc always returns a single 32-bit value to represent the
given UTF-8 character, then clucene can handle the full range of unicode
and we still have investigation to do into what lucene_utf8towcs does
with the return value from lucene_utf8wc.

*** 16-bits:
If lucene_utf8towc always returns either a 16-bit or 32-bit single
value, and presuming the comment to the method to be true, we should be
able to conclude that clucene cannot handle the full range of unicode
characters on platforms that define wchar_t as 16-bits.  16 bits is not
enough bits to represent all unicode values in a single value.

***  some other crazy thing:
If lucene_utf8towc somehow can return multiple 16-bit values to
represent a single character (not sure how it could do this AND have the
comment to the method still be true without a crazy return object
(list<wchar_t>?)) then indeed how I understand your assessment makes
sense: clucene checks the platform (maybe with something like
sizeof(wchar_t)) and then converts to UTF-8 stream to either a UTF-16 or
a UTF-32 encoded stream

So, just to confirm, does lucene_utf8towc really have some way of return
multi-values for a single unicode character on platforms that define
wchar_t as 16-bits?

Since clucene uses wchar_t, my expected conclusion would have been (***
16-bits), above: full range supported on linux, 16-bits of glyph-space
supported on windows.

Thanks again.  Please don't rush to a computer to investigate if you're
not sure.  I also can pull the source for clucene down when I get home
tonight.

	-Troy.

> 
> On 8/28/09, Troy A. Griffitts <scribe at crosswire.org> wrote:
>> Thanks again Matthew.  Writing quick for lack of time right now.
>>
>> In general, we avoid the use of wchar_t because it is define differently
>> on different systems, making its intended use (as a unicode character)
>> holder at best essentially useless for anything other than UTF-16, and
>> at least confusing and ambiguous.
>>
>> I could probably look this up, but since you know where everything is in
>> clucene by now...
>>
>> What EXACTLY is TCHAR defined as (i.e. what is sizeof(TCHAR))?  Same on
>> all platforms?
>>
>> What does lucene_utf8towc return? TCHAR? wchar_t?
>>
>> What I'm trying to determine is:
>>
>> Is clucene expecting UTF-16
>> (which can represent 15 bits of unicode glyph space in 2 bytes,
>> reserving the upper bit as a multicode indicator, and if set then moves
>> to 4+ bytes after 15 bits)?
>>
>> ... or is clucene just saying 16 bits of unicode glyph space is good
>> enough for government work; we're not gonna worry about the rest?
>>
>> From the pros in the definition of the method you gave, it sounds like
>> knowing the sizeof the return value for lucene_utf8towc might tell us
>> the answer.
>>
>> Thanks again for doing the legwork.
>>
>> 	-Troy.
>>
>>
>>
>>
>> Matthew Talbert wrote:
>>>>> We have methods to convert to both UTF-16 and UTF-32 in our engine,
>>>>> which don't need a fixed length buffer, so I would like to replace:
>>>>>
>>>>> lucene_utf8towcs(wcharBuffer, content, MAX_CONV_SIZE);
>>>>>
>>>>> with a call to our code, if we can nail down exactly what clucene wants
>>>>> in the resultant wcharBuffer
>>> lucene_utf8towcs calls lucene_utf8towc for every character; the
>>> comment on the function is this:
>>>
>>> /**
>>>  * lucene_utf8towc:
>>>  * @p: a pointer to Unicode character encoded as UTF-8
>>>  *
>>>  * Converts a sequence of bytes encoded as UTF-8 to a Unicode character.
>>>  * If @p does not point to a valid UTF-8 encoded character, results are
>>>  * undefined. If you are not sure that the bytes are complete
>>>  * valid Unicode characters, you should use lucene_utf8towc_validated()
>>>  * instead.
>>>  *
>>>  * Return value: the resulting character
>>>  **/
>>>
>>> The call to doc->Add actually expects a TCHAR, so if your utf8 to
>>> utf16 conversion can produce a TCHAR, then that's all that would be
>>> necessary I think.
>>>
>>> Matthew
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page