Hi, everyone.
tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with the
same avg compression ratio as 7zip. Can anyone help me test more or
experimentally deploy?
As I understand, compressing full-history dumps for English Wikipedia and
other big wikis takes a lot of resources: enwiki history is about 10TB
unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's
over a day of server time. There's been talk about ways to speed that up in
the past.[1]
It turns out that for history dumps in particular, you can compress many
times faster if you do a first pass that just trims the long chunks of text
that didn't change between revisions. A program called rzip[2] does this
(and rzip's _very_ cool, but fatally for us it can't stream input or
output). The general approach is sometimes called Bentley-McIlroy
compression.[3]
So I wrote something I'm calling histzip.[4] It compresses long repeated
sections using a history buffer of a few MB. If you pipe history XML
through histzip to bzip2, the whole process can go ~100 MB/s/core, so we're
talking an hour or three to pack enwiki on a big box. While it compresses,
it also self-tests by unpacking its output and comparing checksums against
the original. I've done a couple test runs on last month's fullhist dumps
without checksum errors or crashes. Last full run I did, the whole dump
compressed to about 1% smaller than 7zip's output; the exact ratios varied
file to file (I think it's relatively better at pages with many revisions)
but were +/- 10% of 7zip's in general.
Also, less exciting, but histzip's also a reasonably cheap way to get daily
incr dumps about 30% smaller.
Technical datadaump aside: *How could I get this more thoroughly tested,
then maybe added to the dump process, perhaps with an eye to eventually
replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk
to to get started? (I'd dealt with Ariel Glenn before, but haven't seen
activity from Ariel lately, and in any case maybe playing with a new tool
falls under Labs or some other heading than dumps devops.) Am I nuts to be
even asking about this? Are there things that would definitely need to
change for integration to be possible? Basically, I'm trying to get this
from a tech demo to something with real-world utility.
Best,
Randall
[1] Some past discussion/experiments are captured at
http://www.mediawiki.org/wiki/Dbzip2, and some old scripts I wrote are at
https://git.wikimedia.org/commit/operations%2Fdumps/11e9b23b4bc76bf3d89e1fb…
[2]
http://rzip.samba.org/
[3]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=re…
[4]
https://github.com/twotwotwo/histzip