[sword-devel] A possible way to speed up was Re: Search optimized (still too slow)

Daniel Glassey sword-devel@crosswire.org
Thu, 08 Apr 2004 19:27:22 +0100


Hiya,
I was going to wait until I had thought this through (and had got
somewhere) but since it has been brought up I think I'd better mention
it. Quite a while back David White suggested that separating content
from markup would be a good idea. With the files getting big by using
raw OSIS(or is it pseudo-OSIS, I'm not sure) and the search being so
slow in these modules I think it is worth doing - to aim for 1.6.0 or
2.0.0 or whatever the next major version is.
 
What I'm suggesting is to make a new module type that contains a binary
representation of OSIS with the text in one file and the markup in a
second file. I think the markup should be based on something like WBXML
(http://www.w3.org/TR/wbxml/) but have pointers into the text rather
than containing the text.
Suggested name SBXML (Sword Binary XML)
This would mean that the search could be made on just the plain text.
Most filters would only operate on the markup.

If we think it's a good idea then let's try to design this using the
wiki. I've added a page for it[1].

I think it should be possible to subclass the existing classes for use
by new module drivers and filters so that the current code will continue
to work.

Until it would be ready to become core would be optionally included on a
configure option.

I don't think I've explained that very well so questions, discussion,
plain opinions and constructive criticism would be very welcome :)

I'm starting at the bottom up so I'm currently looking at changing
VerseKey (new class VerseKey2) to support multiple versification
systems. I'll explain that once I get far enough to do so. But it's
basically going to be based on the OSIS refsys system[2] and it is going
to lump all the books together rather than separating into testaments. 
Chris, I see now you've already been doing something on the
versification stuff[3], how is that going?

Regards,
Daniel

[1]http://www.crosswire.org/ucgi-bin/twiki/view/Swordapi/SbXml
[2]http://www.ccel.org/refsys/refsys.html
[3]http://www.crosswire.org/ucgi-bin/twiki/view/Swordapi/AlternateVersification

On Thu, 2004-04-08 at 14:59, Joachim Ansorg wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi,
> I spent some time to optimize the search in CVS.
> The problem is/was for example the extensive the use of XMLTag in the filters, 
> I tried to avoid them in the filters where it was possible without having to 
> rewrite them.
> I also used SWBuf::append directly where SWBuf::operator+ was used before.
> 
> I see some good chances where we can optimize:
> 	-Using XMLTag as few as possible
> 	-Change copy constructor of SWBuf to implicit sharing, we have lots of SWBuf 
> copy-constructor calls I think
> 	-optimize SWBuf::append(char), maybe we can tweak the memory allocation to 
> alloc larger blocks but more seldom. the append(char) function gets called 
> more than any other function in a search
> 
> But the best solution would be to parse the text only once and then do the 
> right stuff with it. ATM each filter parses the text again which will make 
> modules with lot's of filters slow (e.g. KJV).
> 
> I got these results (with debug code and profiling code included):
> WEB:
> before:	0m8.233s
> after:	0m7.586s
> 	
> KJV:
> before:	1m35.769s
> after:	0m21.874s
> 
> 
> I have not yet committed, because I have to make sure the code doesn't have 
> some untested bugs.
> 
> Joachim

>