[bt-devel] Re: BibleTime

Daniel Glassey dglassey at gmail.com
Sat Dec 17 08:22:46 MST 2005


On 17/12/05, Lee Carpenter <elc at carpie.net> wrote:
> I saw that SWORD had a clucene option to the search.  Do you know which
> CLucene API it expects?  (0.8.x or 0.9.x)

Sword 1.5.8 uses 0.8.x The stuff I am doing for svn is for 0.9.x

> CLucene 0.9.x series claims
> that it uses UCS2 internally.  My inspection of it shows that it uses
> TCHAR which turns to wchar_t if UNICODE is defined during the build and
> a simple char otherwise.  If running Windows, wchar_t is 2-bytes and
> would essentially be UCS2.  Running on Linux however, wchar_t is 4 bytes
> and would be UCS4.

Well, my understanding is that CLucene puts UCS2 data into the
wchar_t. So it just wastes space rather than actually being UCS4. I
don't know if it uses wchar functions like wcslen - that would get
confused at high codepoints. Afaiu theoretically you could put UTF-8
into wchar if you really wanted but it would be a lot of space wasted.

> That is why I used the conversion functions which
> theoretically would handle either the 2-byte or 4-byte wchar_t.
>
> CLucene is working for me currently, but my language doesn't make use of
> many non-ASCII characters anyway, so I can't say at this point that it
> works correctly for wide characters.  It should work (using the
> conversion routines) unless somewhere in CLucene they make assumptions
> about the width of wchar_t.  Based on the way wchar_t is defined (or not
> defined as the case may be) they should not.
>
> If you like, I can take a look at the SWORD built-in clucene search as
> well...

If you like I can send you my patch offlist.

Regards,
Daniel

> Daniel Glassey wrote:
> > On 16/12/05, Troy A. Griffitts <scribe at crosswire.org> wrote:
> >
> >>Hey guys,
> >>        Just a quick note.  Are you all aware that SWORD does expose clucene
> >>searching in the API.  We have an interface to query if indexes have
> >>been created, and also to ask them to be created (reporting status) if
> >>they have not been.
> >>
> >>        Also, it is my impression that clucene does not yet work correctly with
> >>wide characters (wchar_t is also different sizes on different platforms
> >>(as previously below) and does not conform to any standard).
> >
> >
> > Have you tried it out? My impression is that they are just putting 16
> > bits of data into whatever wchar_t is but I haven't tested it yet so I
> > don't know if it works.



More information about the bt-devel mailing list