[sword-devel] task

Chris Little sword-devel@crosswire.org
Sun, 9 Sep 2001 16:06:42 -0700

> > > Implement SCSU (de)compression drivers--
> > > SCSU is the Standard Compression Scheme for Unicode 
> > > (http://www.unicode.org/unicode/reports/tr6/), which compresses 
> > > Unicode streams by using the fact that most characters in 
> a string 
> > > come from the same code pages and therefore repeat a lot of 
> > > information.  Basically, if you use SCSU and then ZIP the result, 
> > > you'll get something smaller than either of the 
> compression schemes 
> > > alone would produce.  I'll have SCSU (along with 
> UTF-8/16/32) code 
> > > from ICU in CVS sometime pretty soon, but it'll still need to be 
> > > worked into the library.

> The zVerse class' c-tor also takes an SWCompress * to do the 
> compression work, so, in my opinion, we really need an 
> SWCompress * subclass-- SCSUCompress : public SWCompress that 
> understands the SCSU compression scheme.

*sigh*  Troy is just completely wrong here and I think we need to take
away his CVS write access for a week or so as punishment. ;)
Well, I guess we don't really need to blame him since it's not
completely obvious what SCSU really is.

SCSU shouldn't really be thought of as a compression algorithm, but as a
character encoding.  For any Unicode glyph, there is exactly one way to
express it in UTF-8, 16, or 32.  Likewise, for any string of Unicode
glyphs, there is exactly one way to express it in UTF-8, 16, 32, or
SCSU.  SCSU is very good at expressing Unicode strings that use many
glyphs from the same codepage (like Greek, Hebrew, Cyrillic, Armenian,
etc.) in a smaller space than the UTFs.

In addition, SCSU works very well on small strings (unlike ZIP or LZSS)
and you would only save one or two bytes by grouping verses as we do
with ZIP & LZSS.  So for that reason, I would actually recommend against
subclassing SWCompress to do the SCSU driver, and would suggest instead
subclassing SWFilter and doing a SCSU to UTF-8 or UTF-16 filter (using
the ICU macros that I still haven't committed).

I know we were sort of targetting doing all modules in UTF-8, but I
think there's a case to be made for using other Unicode encodings, at
least when they save space.