[sword-devel] HowTo: create ztext module?

Tue May 9 08:10:14 MST 2006

L.Allan-pbio wrote:
>> While not simpler, there are ways to compress the files that don't
>> use stream compression such that each verse can be handled
>> independently.
>
> I'd be interested in how to do this, and still get decent compression. 
The basic, overly simplified technique is to build a mapping from 
characters seen to bit patterns output. A streaming (de)compression 
starts with a default dictionary and as characters are seen, it adjusts 
the dictionary. Rather than doing the analysis on the next character, 
the analysis is done on a "window" of input. IIRC, a larger window 
provides better compression. Since the dictionary is based on what came 
before, it is not possible to synchronize in the middle of the stream.

This kind of compression is not optimal. If it were known in advance 
what was being compressed (i.e. two pass or static knowledge) then a 
more appropriate dictionary could be built.

It has been a number of years since I have examined compression 
techniques, so there may be some recent advances...

IIRC, Huffman encoding seems to produce an optimal compression. The 
basic idea is to build a trie with the shortest paths through the trie 
being the most frequent patterns. The algorithms that I saw did this on 
input assuming a single byte character encoding such as ASCII or 
Latin-1. It is readily adaptable to UTF-8, by considering bytes rather 
than characters.

Further compression can be achieved by analyzing "word" frequency and 
prioritizing them in the dictionary. E.g. The word "the" probably occurs 
more frequently than the letter "Z". If the compression of "t", "h", "e" 
is bigger (the sum of the lengths of the paths and the time spent 
processing them) than "the" would be then it would make sense to put 
"the" in the dictionary. (Greatly simplified! The algorithms I saw 
defined a word as a sequence of letters whose maximum length is bounded.)

Another gain can be to use a smaller, substitutionary representation for 
a markup language. E.g. <hi type="italic">...</hi> could become 
<e>...</e>, where the substitution chosen is highly compressible. 
(Google: xml compression)

The nature of Huffman decompression is that one can start anywhere in 
the bit stream. If the current bit cannot be resolved, it is skipped 
until one can be found to resolve to an entry in the dictionary. From 
that point forward the stream can be understood.

The building of the dictionary would be a single pass of the input and a 
statistical analysis of it. The dictionary would be written as a part of 
the module, probably a separate file.

Compressing the module would probably be best to based on atomic units, 
say verses, since decompression will be done on these units.

I am not aware of any available code to do this. It might exist. But it 
probably would need to be written.

Is it worth the effort? I don't think so at this point and time. My take 
on it is that there is enough to do that this gets pushed further down 
my list of things to do (it is on my todo list). And unless it makes 
sense in the SWORD world as a contribution, it would only be an academic 
exercise for me (which I love doing).

I think that in the LCDBible world, it would make lots of sense.