I'm checking in some code to deal with compressing data in the old
table. The primary motivation here is to decrease the amount of disk
and cache space necessary for storing page revision data, without the
complications and fragility of differential compression.[1]
Compression is done with gzdeflate() / gzinflate(), which requires zlib
support compiled into PHP. This is the same compression that would be
used in a gzip file, but without the header bytes.
Compressed revisions are marked with old_flags="gzip". The old_flags
column has existed unused for quite some time, so no schema change is
necessary. The compressed data goes back into old_text; I don't think
there is a problem with storing binary data in a TEXT field, as
supposedly TEXT and BLOB differ only in matching and sorting
characteristics.
Article::getRevisionText() accepts a row object (as from wfFetchObject)
containing both old_text and old_flags fields and returns the text,
uncompressed if necessary.
This scheme also works in the archive table, maybe... there are
probably problems with undeletion that need to be checked.
So far there's no on-the-fly compression; a maintenance script
compressOld.php is provided to batch-compress old revisions. It can be
given an arbitrary starting point old_id, and will go until it gets to
the end of the table or you kill it. It should be safe to run in the
background while the wiki is live; it makes single-row UPDATEs keyed by
old_id. On my 2 GHz Athlon XP this goes at about 10,000 rows per minute
otherwise unloaded.
I haven't done any comparative testing of load times, but the effect
should be dwarfed by parse/render times and will only come up on old
and diff views and a few other rare places.
I tested with the New Years' dump of the French Wikipedia (about 200k
rows in old).
Raw dump size:
old_table.sql 1,210,368,249
old_compressed.sql 485,536,046
Space saved: ~60%
If these ratios hold, I estimate the total savings at about 14
gigabytes, bringing our total db usage to something more like 20 GB.
This is a reasonably big improvement for very small changes in code.
(Note that the innodb data storage space never shrinks; to reclaim disk
space for purposes other than storing the next couple million edits
would require dumping everything and reimporting it fresh.)
There are a couple of downsides. The SQL dumps get slightly more
illegible, and old revisions won't be loadable on a MediaWiki installed
with some configurations of PHP (the default configure options don't
include zlib). Also, recompressing the resultant dump doesn't do so
well:
old_table.sql.bz2 199,394,376
old_compressed.sql.bz2 416,208,437
This doubles the size of the raw dumps. Ouch! Well, we should be
looking at a more usable dump format anyway.
-- brion vibber (brion @
pobox.com)
[1] Ultimately we'd probably save a lot of disk space by storing diffs
between revisions, but loading an individual revision then requires
sifting through multiple revisions from the last checkpoint, and
requires extra work to ensure that intermediate revisions are not
corrupted, reordered, removed, etc. By compressing each revision
separately, we still maintain the integrity of the rest of the history
if any one revision is corrupted, if histories are reordered or
recombined, if individual revisions are plucked out or blanked for
legal reasons, etc.