I just downloaded the old table for ca.wikipedia.org, and compressed it with my new concatenated gzip history compression feature. The idea is that by concatenating the text from adjacent revisions before compressing with gzip, the gzip algorithm can take advantage of similarities between revisions and thereby acheive a better compression ratio than it would by compressing individual revisions.
The old table in question had 38,697 rows. 36,536 were already compressed with gzip, 2,159 were uncompressed and 2 had an invalid old_flags field. The SQL dump was 46.3 MB.
I compressed it with a maximum chunk size of 10. I wrote an algorithm to change the chunk size depending on compressibility, but it was disabled for this test for performance reasons. With these parameters, the SQL dump after compression was 22.1 MB, 0.48 times the size of the original.
If you take out the headers (edit comments, attribution, etc.) and SQL detritus, the decompressed text is 99.0 MB and the compressed text is 17.6 MB, so that's a compression ratio of 82%.
CRC32 checksums of the decompressed text were recorded before and after the test to check data integrity. There were no errors.
Hopefully we will be able to improve the compression ratio still further, but after this test I would consider the feature to be good enough for a beta release. The obvious direction for future development is to try some sort of diff algorithm.
-- Tim Starling
Tim Starling wrote:
I just downloaded the old table for ca.wikipedia.org, and compressed it with my new concatenated gzip history compression feature. The idea is that by concatenating the text from adjacent revisions before compressing with gzip, the gzip algorithm can take advantage of similarities between revisions and thereby acheive a better compression ratio than it would by compressing individual revisions.
Out of curiosity, have you tried testing bzip2? It's usually much better than gzip with multi-megabyte text data; for example, source for Linux kernel 2.6.9 is ~44 MB with gzip, and ~35 MB with bzip2. I believe it also uses similarities across files, so concatenation may not be necessary. It does use much more RAM and execute more slowly than gzip, however.
-Mark
Delirium wrote:
Out of curiosity, have you tried testing bzip2? It's usually much better than gzip with multi-megabyte text data; for example, source for Linux kernel 2.6.9 is ~44 MB with gzip, and ~35 MB with bzip2. I believe it also uses similarities across files, so concatenation may not be necessary. It does use much more RAM and execute more slowly than gzip, however.
Yes, see http://meta.wikimedia.org/wiki/History_compression . Bzip2 had a much better compression ratio, but it was 3.3 times slower to decompress and 13 times slower to compress. No block size could give it anything like the performance of gzip.
Concatenation is still necessary. In the previous test, bzip2 gave 97% compression for heavily edited articles, which far exceeds anything recorded for individual revisions.
Preliminary testing of a diff method suggests that diffs can give a similar compression ratio to bzip2, but with a speed even faster than gzip.
We've generally assumed that performance is the most important thing. I'm willing to admit the possibility that extra DB hardware for poorly compressed data may turn out to be more expensive than extra apache hardware for better compression. But a diff method may allow us to avoid that tradeoff altogether.
-- Tim Starling
Delirium wrote:
Out of curiosity, have you tried testing bzip2? It's usually much better than gzip with multi-megabyte text data; for example, source for Linux kernel 2.6.9 is ~44 MB with gzip, and ~35 MB with bzip2. I believe it also uses similarities across files, so concatenation may not be necessary. It does use much more RAM and execute more slowly than gzip, however.
I replied:
Concatenation is still necessary. In the previous test, bzip2 gave 97% compression for heavily edited articles, which far exceeds anything recorded for individual revisions.
Sorry I misunderstood what you meant. Article text is stored in the database, not in files. We can't compress articles with "bzip2 /database/revisions/*.txt". Any compression of multiple revisions in the same instance of bzip2 has to managed by MediaWiki.
-- Tim Starling
Tim Starling wrote:
Sorry I misunderstood what you meant. Article text is stored in the database, not in files. We can't compress articles with "bzip2 /database/revisions/*.txt". Any compression of multiple revisions in the same instance of bzip2 has to managed by MediaWiki.
And I had misunderstood what you meant. =]
Perhaps misremembering an old thread on the subject, I thought you were discussing how to cut down on the size of offline database dumps, rather than the database used in the running wiki.
-Mark
wikitech-l@lists.wikimedia.org