New subject: First attempt at concatenation-based history compression gives 82% compression

30 Oct 2004

I just downloaded the old table for ca.wikipedia.org, and compressed it 
with my new concatenated gzip history compression feature. The idea is 
that by concatenating the text from adjacent revisions before 
compressing with gzip, the gzip algorithm can take advantage of 
similarities between revisions and thereby acheive a better compression 
ratio than it would by compressing individual revisions.

The old table in question had 38,697 rows. 36,536 were already 
compressed with gzip, 2,159 were uncompressed and 2 had an invalid 
old_flags field. The SQL dump was 46.3 MB.

I compressed it with a maximum chunk size of 10. I wrote an algorithm to 
change the chunk size depending on compressibility, but it was disabled 
for this test for performance reasons. With these parameters, the SQL 
dump after compression was 22.1 MB, 0.48 times the size of the original.

If you take out the headers (edit comments, attribution, etc.) and SQL 
detritus, the decompressed text is 99.0 MB and the compressed text is 
17.6 MB, so that's a compression ratio of 82%.

CRC32 checksums of the decompressed text were recorded before and after 
the test to check data integrity. There were no errors.

Hopefully we will be able to improve the compression ratio still 
further, but after this test I would consider the feature to be good 
enough for a beta release. The obvious direction for future development 
  is to try some sort of diff algorithm.

-- Tim Starling