[sword-devel] search failing in Hebrew modules

Troy A. Griffitts scribe at crosswire.org
Tue Aug 4 04:16:55 MST 2009


Guys,

Sorry for not being on top of this sooner.  OK, let's hammer this out. 
Karl, thanks for the data, that's great.  This is what I'm planning to 
do when I actually wake up in a few hours:

add a new tests/striptest.cpp

SWMgr library;
SWModule *book = library.getModule(argv[2]);
StringList filters = library.getGlobalOptions;
for (StringList::iterator it = filters.begin(); it != filters.end() ++it) {
	// blindly turn off all filters.  Some filters don't support "Off", but 
that's ok, we should just silently fail on those.
	library.setGlobalOption(*it, "Off");
}
SWBuf entryStripped = book->StripText();
book->setKey(argv[3]);
cout << "RawEntry:\n" << book->getRawEntry() << "\n";
cout << "StripText:\n" << entryStripped << "\n";
cout << "Search Target: " << argv[4] << "\n";
cout << "Search Target StripText: " << book->StripText(argv[4]) << "\n";
cout << "Found: " << ((strstr(entryStripped.c_str(), 
book->StripText(argv[4]).c_str())) ? "true":"false") << endl;

and we'll try it with Karl's example data:

./striptest WLC Gen.1.9 "מתחת"

and send it to a hex display if necessary, and see what we're missing.

I'm guessing the root of the problem is in our UTF8HebrewPoints filter 
missing something, or possibly, if this test outputs "found: true" then 
it might be our case folding code.

Anyway, if someone beats me to it and tries the above test before I wake 
up, let me know the results.

Again, sorry for not being more responsive the last couple days with 
this.  This is something we really need to iron out for Hebrew and other 
languages as well.  Thanks for pushing on this issue.

	-Troy.




Karl Kleinpaste wrote:
> "Troy A. Griffitts" <scribe at crosswire.org> writes:
>> Anyone willing to put the time into investigating if proper UTF-8 is
>> being sent into the SWORD engine from the copy and paste from Xiphos?
> 
> I'll need some help here, converting octal crud from gdb to what folks
> think should be the Hebrew.
> 
> My example search is:
> - Xiphos in up-to-date F11
> - Sword at -r2437
> - WLC 1.6
> - no CLucene index
> - plain ol' multiword search 
>   (sidebar search defaults to "indexed," with fallback to multiword in
>    absence of index)
> - search scope limited to Genesis
> - copying/pasting word #5 from Gen 1:9, "מתחת"
>   (again, XEmacs is not entirely happy w/Hebrew, so I hope that appears
>    properly to the rest of you)
> 
> With vowel points off, stepping through Xiphos' acquisition of the text
> from the input box, search_string is:
> 
> $1 = 0x973f878 "\327\236\327\252\327\227\327\252"
> 
> search_string is untouched down into the Sword search call.  No results.
> 
> Turning vowel points on, but searching on the same un-vowel-pointed
> string changes nothing, I get no results.  (No surprise, but I'm trying
> to be exhaustive.)
> 
> Re-pasting the now-vowel-pointed word for search, search_string is:
> 
> $7 = 0xb9c82e8 "\327\236\326\264\327\252\326\274\326\267\327\227\326\267\327\252"
> 
> And again, no results.
> 
> Matthew says he got results in the vowel points = "on" case, but I
> don't.  The only difference I know between us is that I use Fedora and
> he uses Ubuntu, so there is perhaps some version skew on other linked
> libraries, but there is no other library in between Xiphos and Sword's
> search, so I can't explain how we get different results, when he does
> multiword, non-CLucene searches.
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page




More information about the sword-devel mailing list