Hi dump producers,
I know there's more to the choice of compression format than the size of the resulting dumps (e.g. time, memory, portability, existing code investment) and I read that you looked at LZMA and found it to be of insignificant benefit [1], but I noticed over at the Large Text Compression Benchmark site that they use 7-zip in PPMd mode and did some trial recompressions.
The bzip dumps use 900k blocks and according to the bzip2.org implementation's manual it takes around 7600k while compressing and around 3700k while decompressing. Like LZMA, PPMd apparently uses the same amount of memory for decompression as it used during compression, so I recompressed the XML dump with various amounts of memory so you can make your own comparisons.
Specifically, using 7zip 9.20 from Ubuntu Precise's p7zip-full, I ran:
for MEM in 3700k 7600k 16m 512m; do bzcat enwiki-20120802-pages-articles.xml.bz2 \ | 7z -a -si -m0=PPMd:mem=$MEM \ enwiki-20120802-pages-articles.xml.$MEM.7z done
bzcat enwiki-20120802-pages-articles.xml.bz2 \ | 7z -a -si enwiki-20120802-pages-articles.xml.LZMA.7z
for the following resulting file sizes in bytes (% of .bz2 version):
original bz2: 9143865996 $MEM=3700k : 8648303296 (94.6%) $MEM=7600k : 8043626528 (88.0%) $MEM=16m : 7910637814 (86.5%) (the default for both PPMd & LZMA) LZMA: 7705327210 (84.3%) $MEM=512m : 7076755355 (77.4%)
I wasn't looking to compare running times and absolute values wouldn't compare to your servers but for what it's worth I noticed that LZMA took over twice as long as any PPMd run. I was expecting PPMd to beat LZMA, hence the several PPMd runs.
There's probably some value in experimenting with PPMd's "model order" too, which I didn't try. Google "model order for PPMd" or see /usr/share/doc/p7zip-full/DOCS/MANUAL/switches/method.htm#PPMd (on Debian/Ubuntu).
As the dump servers only have to do it once to save that bandwidth for every download from every mirror that month, perhaps it's worth giving 7zip more memory than bzip or even more than the default, although I appreciate that you drive some users out of the market if compression memory requirements equal decompression requirements and you start using a few gig to compress. Also while you can (with a little effort) seek around bz2s and extract individual blocks, PPMd's seekability isn't something I've explored.
Just some thoughts.
Zak
[1]: "7-Zip's LZMA compression produces significantly smaller files for the full-history dumps, but doesn't do better than bzip2 for our other files." --- http://meta.wikimedia.org/wiki/Data_dumps
Hi Zak,
2012/9/2 Zak Wilcox iwilcox@iwilcox.me.uk:
I know there's more to the choice of compression format than the size of the resulting dumps (e.g. time, memory, portability, existing code investment) and I read that you looked at LZMA and found it to be of insignificant benefit [1], but I noticed over at the Large Text Compression Benchmark site that they use 7-zip in PPMd mode and did some trial recompressions.
Indeed, PPMd often gives better compression ratios than LZMA for plain text.
The bzip dumps use 900k blocks and according to the bzip2.org implementation's manual it takes around 7600k while compressing and around 3700k while decompressing. Like LZMA, PPMd apparently uses the same amount of memory for decompression as it used during compression,
PPMd is a symmetric algorithm, meaning that the code for compression and decompression is pretty much the same and so are the memory and time requirements (that's true for many of the PPM family of algorithms). This is absolutely not the case for LZMA which is completely asymmetric: LZMA uses much less memory for decompression than for compression and is much faster at decompressing than at compressing (that's true for most of the LZ family of algorithms).
so I recompressed the XML dump with various amounts of memory so you can make your own comparisons.
Specifically, using 7zip 9.20 from Ubuntu Precise's p7zip-full, I ran:
for MEM in 3700k 7600k 16m 512m; do bzcat enwiki-20120802-pages-articles.xml.bz2 \ | 7z -a -si -m0=PPMd:mem=$MEM \ enwiki-20120802-pages-articles.xml.$MEM.7z done
bzcat enwiki-20120802-pages-articles.xml.bz2 \ | 7z -a -si enwiki-20120802-pages-articles.xml.LZMA.7z
for the following resulting file sizes in bytes (% of .bz2 version):
original bz2: 9143865996 $MEM=3700k : 8648303296 (94.6%) $MEM=7600k : 8043626528 (88.0%) $MEM=16m : 7910637814 (86.5%) (the default for both PPMd & LZMA) LZMA: 7705327210 (84.3%) $MEM=512m : 7076755355 (77.4%)
That's interesting results. Have you tried other settings for LZMA (different dictionary size, different match finder, different number of fast bytes) ?
I wasn't looking to compare running times and absolute values wouldn't compare to your servers but for what it's worth I noticed that LZMA took over twice as long as any PPMd run. I was expecting PPMd to beat LZMA, hence the several PPMd runs.
Yeah, LZMA compression is pretty slow, but decompression is fast. LZMA2 can help for compression speed, as it is designed with parallel compression in mind, but it'll eat even more memory.
There's probably some value in experimenting with PPMd's "model order" too, which I didn't try. Google "model order for PPMd" or see /usr/share/doc/p7zip-full/DOCS/MANUAL/switches/method.htm#PPMd (on Debian/Ubuntu).
The more you increase the model order, the more memory is needed. For both compression and decompression.
As the dump servers only have to do it once to save that bandwidth for every download from every mirror that month, perhaps it's worth giving 7zip more memory than bzip or even more than the default, although I appreciate that you drive some users out of the market if compression memory requirements equal decompression requirements and you start using a few gig to compress. Also while you can (with a little effort) seek around bz2s and extract individual blocks, PPMd's seekability isn't something I've explored.
You can't seek in PPMd because the model is trained on every single bit. But that's true for LZMA too. You can seek in XZ archives because the input is split in blocks (like for bzip2), but I'm not sure whether it's a feature of LZMA2 or XZ.
Best regards,
xmldatadumps-l@lists.wikimedia.org