I'm checking in some code to deal with compressing data in the old table. The primary motivation here is to decrease the amount of disk and cache space necessary for storing page revision data, without the complications and fragility of differential compression.[1]
Compression is done with gzdeflate() / gzinflate(), which requires zlib support compiled into PHP. This is the same compression that would be used in a gzip file, but without the header bytes.
Compressed revisions are marked with old_flags="gzip". The old_flags column has existed unused for quite some time, so no schema change is necessary. The compressed data goes back into old_text; I don't think there is a problem with storing binary data in a TEXT field, as supposedly TEXT and BLOB differ only in matching and sorting characteristics.
Article::getRevisionText() accepts a row object (as from wfFetchObject) containing both old_text and old_flags fields and returns the text, uncompressed if necessary.
This scheme also works in the archive table, maybe... there are probably problems with undeletion that need to be checked.
So far there's no on-the-fly compression; a maintenance script compressOld.php is provided to batch-compress old revisions. It can be given an arbitrary starting point old_id, and will go until it gets to the end of the table or you kill it. It should be safe to run in the background while the wiki is live; it makes single-row UPDATEs keyed by old_id. On my 2 GHz Athlon XP this goes at about 10,000 rows per minute otherwise unloaded.
I haven't done any comparative testing of load times, but the effect should be dwarfed by parse/render times and will only come up on old and diff views and a few other rare places.
I tested with the New Years' dump of the French Wikipedia (about 200k rows in old).
Raw dump size: old_table.sql 1,210,368,249 old_compressed.sql 485,536,046 Space saved: ~60%
If these ratios hold, I estimate the total savings at about 14 gigabytes, bringing our total db usage to something more like 20 GB. This is a reasonably big improvement for very small changes in code. (Note that the innodb data storage space never shrinks; to reclaim disk space for purposes other than storing the next couple million edits would require dumping everything and reimporting it fresh.)
There are a couple of downsides. The SQL dumps get slightly more illegible, and old revisions won't be loadable on a MediaWiki installed with some configurations of PHP (the default configure options don't include zlib). Also, recompressing the resultant dump doesn't do so well:
old_table.sql.bz2 199,394,376 old_compressed.sql.bz2 416,208,437
This doubles the size of the raw dumps. Ouch! Well, we should be looking at a more usable dump format anyway.
-- brion vibber (brion @ pobox.com)
[1] Ultimately we'd probably save a lot of disk space by storing diffs between revisions, but loading an individual revision then requires sifting through multiple revisions from the last checkpoint, and requires extra work to ensure that intermediate revisions are not corrupted, reordered, removed, etc. By compressing each revision separately, we still maintain the integrity of the rest of the history if any one revision is corrupted, if histories are reordered or recombined, if individual revisions are plucked out or blanked for legal reasons, etc.