Hi, everyone.
tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with the same avg compression ratio as 7zip. Can anyone help me test more or experimentally deploy?
As I understand, compressing full-history dumps for English Wikipedia and other big wikis takes a lot of resources: enwiki history is about 10TB unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's over a day of server time. There's been talk about ways to speed that up in the past.[1]
It turns out that for history dumps in particular, you can compress many times faster if you do a first pass that just trims the long chunks of text that didn't change between revisions. A program called rzip[2] does this (and rzip's _very_ cool, but fatally for us it can't stream input or output). The general approach is sometimes called Bentley-McIlroy compression.[3]
So I wrote something I'm calling histzip.[4] It compresses long repeated sections using a history buffer of a few MB. If you pipe history XML through histzip to bzip2, the whole process can go ~100 MB/s/core, so we're talking an hour or three to pack enwiki on a big box. While it compresses, it also self-tests by unpacking its output and comparing checksums against the original. I've done a couple test runs on last month's fullhist dumps without checksum errors or crashes. Last full run I did, the whole dump compressed to about 1% smaller than 7zip's output; the exact ratios varied file to file (I think it's relatively better at pages with many revisions) but were +/- 10% of 7zip's in general.
Also, less exciting, but histzip's also a reasonably cheap way to get daily incr dumps about 30% smaller.
Technical datadaump aside: *How could I get this more thoroughly tested, then maybe added to the dump process, perhaps with an eye to eventually replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk to to get started? (I'd dealt with Ariel Glenn before, but haven't seen activity from Ariel lately, and in any case maybe playing with a new tool falls under Labs or some other heading than dumps devops.) Am I nuts to be even asking about this? Are there things that would definitely need to change for integration to be possible? Basically, I'm trying to get this from a tech demo to something with real-world utility.
Best, Randall
[1] Some past discussion/experiments are captured at http://www.mediawiki.org/wiki/Dbzip2, and some old scripts I wrote are at https://git.wikimedia.org/commit/operations%2Fdumps/11e9b23b4bc76bf3d89e1fb3... [2] http://rzip.samba.org/ [3] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep... [4] https://github.com/twotwotwo/histzip