[sword-devel] indexed search discrepancy

Matthew Talbert ransom1982 at gmail.com
Fri Aug 28 14:20:09 MST 2009


On Fri, Aug 28, 2009 at 1:38 PM, Troy A. Griffitts<scribe at crosswire.org> wrote:
> Thanks for investigating this Matthew.  There shouldn't really be any
> repercussions to increasing this within reason, though I would like to
> find a way to remove this code if we can.
>
> Does anyone know if clucene REALLY wants a wchar_t buffer, and if so,
> what EXACTLY does it want?

The call doc->add( *Field::Text(_T("key"), wcharBuffer)) expects a wchar_t.

> wchar_t on windows is 16 bits, and on linux is typically 32 bits.
>
> This would mean that likely it expects UTF-16???  Or maybe just limits
> to 16 bit characters and doesn't support the full Unicode range (at
> least on windows)?

Actually, 16 bits would be enough for UTF-16. On linux, it would be
UTF_32, yes? clucene can be compiled with support for UTF_32/UCS_4 on
Windows as well, though I'm not quite sure how to accomplish that.

> We have methods to convert to both UTF-16 and UTF-32 in our engine,
> which don't need a fixed length buffer, so I would like to replace:
>
> lucene_utf8towcs(wcharBuffer, content, MAX_CONV_SIZE);
>
> with a call to our code, if we can nail down exactly what clucene wants
> in the resultant wcharBuffer

It needs to be something that can be cast to wchar_t[].

> Anyway, for now, upping the buffer should be fine, or dynamically
> allocating to say 2*source length should also be practically safe, but
> some of our module drivers support a 4 byte size, so retaining a static
> buffer with a fixed size would mean we'd need to make it fairly large to
> support the full range of data.

BT has the value set at 1024 * 1024. They also have a call to the
writer, like so
"writer->setMaxFieldLength(BT_MAX_LUCENE_FIELD_LENGTH)", the value
being, again 1024 * 1024. I suspect that this call is also necessary
for some texts.


This is a separate issue, but I noticed something else about BT. They
have these two lines:
const TCHAR* stop_words[] = { NULL };
lucene::analysis::standard::StandardAnalyzer an( (const TCHAR**)stop_words );

This would be really nice to have added to sword. Currently, if you
search for a stop word inside Xiphos, SWORD causes a segfault (eg,
"the", "is"), which would be prevented by adding the above lines.

Matthew



More information about the sword-devel mailing list