[bt-devel] Re: BibleTime

Sat Dec 17 20:55:24 MST 2005

Daniel Glassey wrote:
> 
> Well, my understanding is that CLucene puts UCS2 data into the
> wchar_t. So it just wastes space rather than actually being UCS4. I
> don't know if it uses wchar functions like wcslen - that would get
> confused at high codepoints. Afaiu theoretically you could put UTF-8
> into wchar if you really wanted but it would be a lot of space wasted.
> 

This is my understanding as well.  I just didn't state it clearly.  UCS2 
could never contain UCS4 information since it "throws away" anything 
over 16-bits.  If it was UTF-16, then there could be a "correct" 
conversion to UCS4.  But I agree, as it stands, it's UCS2 in 4-byte 
variable.  It seemed to me that CLucenes' native platform was Windows 
where wchar_t is 2 bytes and then there was work done later to support 
Linux.  Using wchar_t was probably the easiest way to get UCS2 support.

>>>>       Also, it is my impression that clucene does not yet work correctly with
>>>>wide characters (wchar_t is also different sizes on different platforms
>>>>(as previously below) and does not conform to any standard).
>>>
>>>
>>>Have you tried it out? My impression is that they are just putting 16
>>>bits of data into whatever wchar_t is but I haven't tested it yet so I
>>>don't know if it works.

In my testing with it, it works fine coming from QTs UCS-2 to UTF8 then 
from UTF8 to wchar_t.  All the data I've tested is English so its all 
8-bit wide data, but nothing gets mangled or transposed.  I don't know 
if the conversions from UTF8 to wchar_t are correct or not.  Perhaps 
someone who routinely uses 16-bit (or greater) characters could test this?

Lee C.