Delirium wrote:
Out of curiosity, have you tried testing bzip2?
It's usually much
better than gzip with multi-megabyte text data; for example, source for
Linux kernel 2.6.9 is ~44 MB with gzip, and ~35 MB with bzip2. I
believe it also uses similarities across files, so concatenation may not
be necessary. It does use much more RAM and execute more slowly than
gzip, however.
Yes, see
http://meta.wikimedia.org/wiki/History_compression . Bzip2 had
a much better compression ratio, but it was 3.3 times slower to
decompress and 13 times slower to compress. No block size could give it
anything like the performance of gzip.
Concatenation is still necessary. In the previous test, bzip2 gave 97%
compression for heavily edited articles, which far exceeds anything
recorded for individual revisions.
Preliminary testing of a diff method suggests that diffs can give a
similar compression ratio to bzip2, but with a speed even faster than gzip.
We've generally assumed that performance is the most important thing.
I'm willing to admit the possibility that extra DB hardware for poorly
compressed data may turn out to be more expensive than extra apache
hardware for better compression. But a diff method may allow us to avoid
that tradeoff altogether.
-- Tim Starling