[sword-devel] HowTo: create ztext module?

Tue May 9 09:06:10 MST 2006

> IIRC, Huffman encoding seems to produce an optimal compression. The
> basic idea is to build a trie with the shortest paths through the 
> trie
> being the most frequent patterns. The algorithms that I saw did this
> on input assuming a single byte character encoding such as ASCII or
> Latin-1. It is readily adaptable to UTF-8, by considering bytes 
> rather
> than characters.

I don't think this is typically true. At least for text, LZW type 
compression is generally superior (at least in compression ratio, not 
necessarily in speed).

> I am not aware of any available code to do this. It might exist. But
> it probably would need to be written.
>
> Is it worth the effort? I don't think so at this point and time. My
> take on it is that there is enough to do that this gets pushed
> further down my list of things to do (it is on my todo list). And
> unless it makes sense in the SWORD world as a contribution, it would
> only be an academic exercise for me (which I love doing).
>
> I think that in the LCDBible world, it would make lots of sense.

A year or so ago, I defined a sourceforge project BibleDb that would 
be optimized for Bible decompression/decryption/search speed (not 
necessarily for compression ratio). The idea was a variable number of 
bits based on an analysis of word frequency. (6 or 10 or 14 bits). All 
tags would be external lengths/offsets, and not in the actual content 
in order to optimize searching.

As a group, all English Bibles have a fairly small number of words 
(about 16,000 ... give or take a thousand or so, depending on how you 
count capitalization, plurals, possessives, contractions, etc.), and 
the dictionary is very static. The ESV and WEB would have almost the 
exact same dictionary ... the KJV-1769/ASV would be only slightly 
different. A single dictionary would suffice for all English 
translations. (maybe a different dictionary for OT and NT?).

One intent was to have searches integrated in this (sort of like 
Lucene works?), and dictionaries / concordances would be feasible.

After some wrestling with it, I realized I don't have the time or math 
background or aptitude to have much of a chance of making it work. 
BibleDB is only in pre-alpha stage.
http://sourceforge.net/project/admin/?group_id=117234