[sword-devel] Suggestion

Chris Little sword-devel@crosswire.org
Fri, 24 Dec 1999 21:47:55 -0800


I would strongly urge against changing our module format at all.  What we
have is simple to manage and create, portable, and sufficiently fast.
Perhaps I should clarify the different formats we use.  Basically we have 2
different types of indexes.  One type for verse-based modules (Bibles and
commentaries) and one for article-based modules (lexicon/dictionary
modules).

There verse-based modules actually have 3 different indexes (.vss, .cps,
.bks).  The .bks file is a list of locations within the .cps file where the
chapters for a give book are located.  The .cps file is a list of locations
within the .vss file where the verses for a give chapter are located.  And
the .vss file is a list of locations within the actual testament text file
where the text for a given verse is located.  Both the .bks and .cps files
are unused at the moment because we use KJV verse numbering only at the
moment, so they are loaded from a static array.  So when you ask for Mk 6:3,
first the location of the book of Mark's chapters is looked up, then the
verses of chapter 6, then the location of chapter 3 within the 'nt' text
file.  This should all be easily extendable when we move to allowing non-KJV
numbering, apocryphal books, etc, though these will take a slight hit in
performance because they have to read through 3 disk-based indices instead
of only 1.

The article-based modules use only one index.  It just lists the locations
(and lengths) of all the articles.  When you do a lookup, the SWORD library
does a binary search on the key, which seems sufficiently efficient to me.

The formats are entirely portable.  The only issue we had was big vs. little
endianism used in the indices, but this was fixed for the Solaris port a
couple months ago.

We have other things that appear like they may be different formats
(hrefcom, rawcom, rawtext, rawld, rawgbf, etc.) but they're really just
slight variations on the two main formats.  hrefcom references has a list of
URLs to show for given verses, rawcom is a regular verse-based format for
commentaries, rawtext is a regular verse-based format for Bibles, rawld is a
regular article-based format for lex/dict modules, etc.  (rawgbf is a
deprecated format.  We just use rawtext with sourcetype=GBF in the .conf
file instead now.)

We also support many different text markup formats (plain text, RTF, GBF,
HTML, & soon ThML).  That's a good thing.  No need to change that.  People
can use the formats best suited to their text (GBF is great for Bibles, ThML
is great for commentaries, plain text is fine if there is no markup in the
text source, RTF is good for nothing<g>) or with which they are most
comfortable.

We don't actually index our data content at all.  Indexing is only for
lookups; searching is done linearly through the whole module text.  IMO,
this method is acceptably fast.  Searching every one of our modules except
the losungen (at least 100 modules total) took me 2 minutes (though some of
that might have been due to network lag since I was doing the search over a
modem through the cgi interface).  That's about 100 modules, 598mb of data
searched on a 450mhz machine with an ATA-33 drive, which seems like pretty
acceptable performance to me.

The only database based Bible systems I know of are Bible Pro for Windows
(written in VB and using Bibles in an Access file format) and an IRC
BibleBot on Undernet (which uses SQL and is very fast, but I don't know any
of the specifics).  SQL I guess has a lot of isses that we would want to
deal with like portability and free implementations, though I'm not
personally familiar with it.

If we did decide to move to another module format, we could still retain the
old formats, however, because Troy built SWORD with such a nifty modular
architecture.

It's good to hear suggestions flowing and being discussed though.

--Chris