Compressed old revisions -- please test - Wikitech-l

3 Jan 2004

I'm checking in some code to deal with compressing data in the old 
table. The primary motivation here is to decrease the amount of disk 
and cache space necessary for storing page revision data, without the 
complications and fragility of differential compression.[1]

Compression is done with gzdeflate() / gzinflate(), which requires zlib 
support compiled into PHP. This is the same compression that would be 
used in a gzip file, but without the header bytes.

Compressed revisions are marked with old_flags="gzip". The old_flags 
column has existed unused for quite some time, so no schema change is 
necessary. The compressed data goes back into old_text; I don't think 
there is a problem with storing binary data in a TEXT field, as 
supposedly TEXT and BLOB differ only in matching and sorting 
characteristics.

Article::getRevisionText() accepts a row object (as from wfFetchObject) 
containing both old_text and old_flags fields and returns the text, 
uncompressed if necessary.

This scheme also works in the archive table, maybe... there are 
probably problems with undeletion that need to be checked.

So far there's no on-the-fly compression; a maintenance script 
compressOld.php is provided to batch-compress old revisions. It can be 
given an arbitrary starting point old_id, and will go until it gets to 
the end of the table or you kill it. It should be safe to run in the 
background while the wiki is live; it makes single-row UPDATEs keyed by 
old_id. On my 2 GHz Athlon XP this goes at about 10,000 rows per minute 
otherwise unloaded.

I haven't done any comparative testing of load times, but the effect 
should be dwarfed by parse/render times and will only come up on old 
and diff views and a few other rare places.

I tested with the New Years' dump of the French Wikipedia (about 200k 
rows in old).

Raw dump size:
old_table.sql 1,210,368,249
old_compressed.sql 485,536,046
Space saved: ~60%

If these ratios hold, I estimate the total savings at about 14 
gigabytes, bringing our total db usage to something more like 20 GB. 
This is a reasonably big improvement for very small changes in code. 
(Note that the innodb data storage space never shrinks; to reclaim disk 
space for purposes other than storing the next couple million edits 
would require dumping everything and reimporting it fresh.)

There are a couple of downsides. The SQL dumps get slightly more 
illegible, and old revisions won't be loadable on a MediaWiki installed 
with some configurations of PHP (the default configure options don't 
include zlib). Also, recompressing the resultant dump doesn't do so 
well:

old_table.sql.bz2 199,394,376
old_compressed.sql.bz2 416,208,437

This doubles the size of the raw dumps. Ouch! Well, we should be 
looking at a more usable dump format anyway.

-- brion vibber (brion @ pobox.com)

[1] Ultimately we'd probably save a lot of disk space by storing diffs 
between revisions, but loading an individual revision then requires 
sifting through multiple revisions from the last checkpoint, and 
requires extra work to ensure that intermediate revisions are not 
corrupted, reordered, removed, etc. By compressing each revision 
separately, we still maintain the integrity of the rest of the history 
if any one revision is corrupted, if histories are reordered or 
recombined, if individual revisions are plucked out or blanked for 
legal reasons, etc.