Tim Starling wrote:
I just downloaded the old table for
ca.wikipedia.org,
and compressed
it with my new concatenated gzip history compression feature. The idea
is that by concatenating the text from adjacent revisions before
compressing with gzip, the gzip algorithm can take advantage of
similarities between revisions and thereby acheive a better
compression ratio than it would by compressing individual revisions.
Out of curiosity, have you tried testing bzip2? It's usually much
better than gzip with multi-megabyte text data; for example, source for
Linux kernel 2.6.9 is ~44 MB with gzip, and ~35 MB with bzip2. I
believe it also uses similarities across files, so concatenation may not
be necessary. It does use much more RAM and execute more slowly than
gzip, however.
-Mark