[bt-devel] Re: BibleTime

Sat Dec 17 21:08:15 MST 2005

Martin Gruner wrote:

>>I fought the Unicode issue again.  The search string came from QT in
>>UCS2.  CLucene uses TCHAR which is a wchar_t if built for Unicode and
>>just char if built for ANSI.  To make matters worse, wchar_t is 2 bytes
>>on Windows and 4 bytes on Linux.  Fortunately, I found some conversion
>>utilities in CLucene that allowed me to convert from utf8 to wide-char
>>strings.  So I use QString to convert to UTF8 then those utils to
>>convert to CLucenes wchar types.  Then I search my index and convert the
>>results from wchar types to utf8 to stuff back into SWKey results. *Phew*
> 
> I suggest that we _demand_ that users install clucene built for Unicode. Isn't 
> wchar_t UCS2? Perhaps we could speed up index creation if we have a direct 
> conversion routine, instead of UCS2 - UTF8 - WCHAR_T (UCS2)? I'm no expert 
> here. We could add that later, also.

As you've seen from the other posts to the list, on Linux wchar_t is 
essentially UCS4.  The only reason I have to from UCS2 to UCS4 is to 
handle the input string from QT which comes natively in UCS2.  I could 
write the routine to directly stuff UCS2 chars into 4-byte variables, 
but since it was a incredibly small amount of data, I just used the 
convenience functions that were provided.

Since the SWORD modules are already UTF8, there is no "middle man" in 
that conversion...
> 
>>1. Search syntax.  As you know CLucene has a rich search syntax.  Do we
>>want to expose that syntax directly (i.e. the user types their query in
>>the syntax supported by CLucene) or do we want to break out the syntax
>>into user interface elements (e.g. the AND/OR/ANY buttons, etc.)?
>>
>>2. Do we want index-based searching to be "the search method" or do we
>>want it to be an option along with the search that's there now?
> 
> 
> It will be the standard and the only method. =) And IMO we should directly 
> expose the search syntax and offer some nice help for users to learn it. This 
> means that we can remove many buttons/boxes in the search dialog. Going to be 
> easier for us and more flexible for the users.

Ok, I agree.  It is a rich syntax.

  >>3. Index-building.  When do we want to build the index?  It almost makes
>>sense to build the index when the user adds a module.  However, this is
>>a potentially long operation.  We could kick off a thread to do it and
>>keep the UI free for other purposes.  Also, we could do like most search
>>engines and force the user to build the index the first time they search.
> 
> 
> The last is what I'd suggest.
> Another question: Will we be able to access the index directly, e.g. getting a 
> list of all words starting or ending with XY? I have plans for an "instant 
> concordance" function later which would operate on the index.
> You could make a little blocking pop-up window that just says "(Re)building 
> index for module XY, this may take a while" and has a progress bar. No user 
> interaction needed.

Ok.  A seemingly simply to get what you want with the index is to 
perform a a CLucene search and read the returned Hits directly (as 
opposed having them returned in SWKey lists.)

> 
>>4. Index-location.  Where do we store the index?  Do we currently have a
>>.bibletime or something to store such things? (I might be able to answer
>>this myself, I haven't looked for it yet.)  
> 
> 
> You can use:
> QString dir( KGlobal::dirs()->saveLocation("data", "bibletime/indices/") );
> 
> On my system, this will return ~/.kde/share/apps/bibletime/indices/, which 
> would be a nice location. ~/.kde/share/apps/bibletime/cache/ is where we 
> currently store the lexicon entry cache files (very simple logic). Indexes 
> also need to be rebuilt should the version of an installed module OR the way 
> we create indexes change. So I guess our module version number and the "index 
> layout" version number need to be stored somewhere. Whenever the index layout 
> changes, we increase the index layout version number, and all indices will be 
> rebuilt for the users.
> We also perhaps need a button to "Delete all index and cache files", if a user 
> has disk space problems.

Ok.

> 
>>Also, what about Bibles? 
>>Their indices are not going to change.  Should we distribute index files
>>with the modules?  The user wouldn't have to build at all!
> 
> 
> This is not possible, because Crosswire distributes the module files, and 
> we'll likely use a different index format than other Sword frontends. So I 
> guess we'll have to take care of it.
> 
> How long does it take? How big do they get?

My current test index using the SimpleAnalyzer is with the KJV and it's 
42 MB.  I didn't time it, but it seemed to take 2 to 3 minutes on my 
Athlon 2.13 GHz.

> 
>>5. Analyzers.  It seems that there are many different Analyzers that can
>>be used to build an index.  (Some that differentiate between lower and
>>uppercase, some that take into account grammar rules for certain
>>languages, etc.)  Do we want this flexibility extended to the user?  Or
>>do we just use the simple analyzer which simply breaks up words?
> 
> 
> I don't know, have to read more. Perhaps we should start with the simple one?

Ok. Seems to work fine.

> Lee, I just tagged cvs with rel-1-5-3 to reflect the status of the 1.5.3 
> release which just came out. Feel free to start working in cvs HEAD. Should 
> we need to make more bugfix releases in the meantime, we can create a branch 
> and work there. Once this works well and is documented, we can release 1.6.
> 
Ok.

Thanks,

Lee C.