Hi dump producers,
I know there's more to the choice of compression format than the size of the resulting dumps (e.g. time, memory, portability, existing code investment) and I read that you looked at LZMA and found it to be of insignificant benefit [1], but I noticed over at the Large Text Compression Benchmark site that they use 7-zip in PPMd mode and did some trial recompressions.
The bzip dumps use 900k blocks and according to the bzip2.org implementation's manual it takes around 7600k while compressing and around 3700k while decompressing. Like LZMA, PPMd apparently uses the same amount of memory for decompression as it used during compression, so I recompressed the XML dump with various amounts of memory so you can make your own comparisons.
Specifically, using 7zip 9.20 from Ubuntu Precise's p7zip-full, I ran:
for MEM in 3700k 7600k 16m 512m; do bzcat enwiki-20120802-pages-articles.xml.bz2 \ | 7z -a -si -m0=PPMd:mem=$MEM \ enwiki-20120802-pages-articles.xml.$MEM.7z done
bzcat enwiki-20120802-pages-articles.xml.bz2 \ | 7z -a -si enwiki-20120802-pages-articles.xml.LZMA.7z
for the following resulting file sizes in bytes (% of .bz2 version):
original bz2: 9143865996 $MEM=3700k : 8648303296 (94.6%) $MEM=7600k : 8043626528 (88.0%) $MEM=16m : 7910637814 (86.5%) (the default for both PPMd & LZMA) LZMA: 7705327210 (84.3%) $MEM=512m : 7076755355 (77.4%)
I wasn't looking to compare running times and absolute values wouldn't compare to your servers but for what it's worth I noticed that LZMA took over twice as long as any PPMd run. I was expecting PPMd to beat LZMA, hence the several PPMd runs.
There's probably some value in experimenting with PPMd's "model order" too, which I didn't try. Google "model order for PPMd" or see /usr/share/doc/p7zip-full/DOCS/MANUAL/switches/method.htm#PPMd (on Debian/Ubuntu).
As the dump servers only have to do it once to save that bandwidth for every download from every mirror that month, perhaps it's worth giving 7zip more memory than bzip or even more than the default, although I appreciate that you drive some users out of the market if compression memory requirements equal decompression requirements and you start using a few gig to compress. Also while you can (with a little effort) seek around bz2s and extract individual blocks, PPMd's seekability isn't something I've explored.
Just some thoughts.
Zak
[1]: "7-Zip's LZMA compression produces significantly smaller files for the full-history dumps, but doesn't do better than bzip2 for our other files." --- http://meta.wikimedia.org/wiki/Data_dumps