I just downloaded the old table for
ca.wikipedia.org, and compressed it
with my new concatenated gzip history compression feature. The idea is
that by concatenating the text from adjacent revisions before
compressing with gzip, the gzip algorithm can take advantage of
similarities between revisions and thereby acheive a better compression
ratio than it would by compressing individual revisions.
The old table in question had 38,697 rows. 36,536 were already
compressed with gzip, 2,159 were uncompressed and 2 had an invalid
old_flags field. The SQL dump was 46.3 MB.
I compressed it with a maximum chunk size of 10. I wrote an algorithm to
change the chunk size depending on compressibility, but it was disabled
for this test for performance reasons. With these parameters, the SQL
dump after compression was 22.1 MB, 0.48 times the size of the original.
If you take out the headers (edit comments, attribution, etc.) and SQL
detritus, the decompressed text is 99.0 MB and the compressed text is
17.6 MB, so that's a compression ratio of 82%.
CRC32 checksums of the decompressed text were recorded before and after
the test to check data integrity. There were no errors.
Hopefully we will be able to improve the compression ratio still
further, but after this test I would consider the feature to be good
enough for a beta release. The obvious direction for future development
is to try some sort of diff algorithm.
-- Tim Starling