[sword-devel] indexed search discrepancy
ransom1982 at gmail.com
Sat Aug 29 05:33:35 MST 2009
> OK, I'm still unclear on what's happening after spending time digging
> through the source. The part that poses the biggest problem is that
> lucene_utf8towc appears (to me) to be getting a correct, 32-bit value
> which it stores as an int. However, it then assigns this int directly
> to wchar_t. So if I'm understanding this correctly, then if the
> Unicode value happens to be too big to store in 16-bits, then this
> would be incorrect. It would essentially cause data corruption, yes?
> However, there may be more going on here than just this. For instance,
> there's a "repl_wchar.h" file, a "PlatformWin32.h" file, other config
> files all spending a good deal of code on things like _UCS2. So it's
> possible that somehow it works correctly. However, the fact that I got
> different search results on Windows would indicate that there is
> certainly a difference. I got more results, so would that indicate
> that wchar_t is bigger or smaller? I think it would mean smaller.
After spending some more time on this, I believe that it is converting
to UCS2 on win32 platforms (and USC4 where wchar_t is 32bit).
Therefore it wouldn't handle Unicode outside of the BMP on Windows. In
addition, I don't think the analyzers can handle multi-byte
characters, so we shouldn't try to convert it to proper UTF-16.
I have to wonder though, if we should be worrying about this
particular function and trying to optimize around it. Wouldn't moving
to something dynamically allocated affect performance negatively? (If
there's even any difference).
In addition, we should (imo) be more worried about passing correct
utf-8 to the function in the first place. There's a comment in the
(SWORD) code that casts doubt on whether that's always the case. And,
as I mentioned earlier, there are a few things that cause segfaults,
including searching for stop words ("is", "the"). The other bug that
we've had reported is that SWORD will index the module without
bothering to check whether the location is writable. After indexing,
it segfaults when it isn't writable.
I guess my vote is to just give it a big value, add the appropriate
call to the writer to increase the field size, then worry about some
of these other issues.
More information about the sword-devel