Hi Zak,
2012/9/2 Zak Wilcox <iwilcox(a)iwilcox.me.uk>uk>:
I know there's more to the choice of compression
format than the size of the
resulting dumps (e.g. time, memory, portability, existing code investment)
and I read that you looked at LZMA and found it to be of insignificant
benefit [1], but I noticed over at the Large Text Compression Benchmark site
that they use 7-zip in PPMd mode and did some trial recompressions.
Indeed, PPMd often gives better compression ratios than LZMA for plain text.
The bzip dumps use 900k blocks and according to the
bzip2.org
implementation's manual it takes around 7600k while compressing and around
3700k while decompressing. Like LZMA, PPMd apparently uses the same amount
of memory for decompression as it used during compression,
PPMd is a symmetric algorithm, meaning that the code for compression
and decompression is pretty much the same and so are the memory and
time requirements (that's true for many of the PPM family of
algorithms). This is absolutely not the case for LZMA which is
completely asymmetric: LZMA uses much less memory for decompression
than for compression and is much faster at decompressing than at
compressing (that's true for most of the LZ family of algorithms).
so I recompressed
the XML dump with various amounts of memory so you can make your own
comparisons.
Specifically, using 7zip 9.20 from Ubuntu Precise's p7zip-full, I ran:
for MEM in 3700k 7600k 16m 512m; do
bzcat enwiki-20120802-pages-articles.xml.bz2 \
| 7z -a -si -m0=PPMd:mem=$MEM \
enwiki-20120802-pages-articles.xml.$MEM.7z
done
bzcat enwiki-20120802-pages-articles.xml.bz2 \
| 7z -a -si enwiki-20120802-pages-articles.xml.LZMA.7z
for the following resulting file sizes in bytes (% of .bz2 version):
original bz2: 9143865996
$MEM=3700k : 8648303296 (94.6%)
$MEM=7600k : 8043626528 (88.0%)
$MEM=16m : 7910637814 (86.5%) (the default for both PPMd & LZMA)
LZMA: 7705327210 (84.3%)
$MEM=512m : 7076755355 (77.4%)
That's interesting results. Have you tried other settings for LZMA
(different dictionary size, different match finder, different number
of fast bytes) ?
I wasn't looking to compare running times and
absolute values wouldn't
compare to your servers but for what it's worth I noticed that LZMA took
over twice as long as any PPMd run. I was expecting PPMd to beat LZMA,
hence the several PPMd runs.
Yeah, LZMA compression is pretty slow, but decompression is fast.
LZMA2 can help for compression speed, as it is designed with parallel
compression in mind, but it'll eat even more memory.
There's probably some value in experimenting with
PPMd's "model order" too,
which I didn't try. Google "model order for PPMd" or see
/usr/share/doc/p7zip-full/DOCS/MANUAL/switches/method.htm#PPMd (on
Debian/Ubuntu).
The more you increase the model order, the more memory is needed. For
both compression and decompression.
As the dump servers only have to do it once to save
that bandwidth for every
download from every mirror that month, perhaps it's worth giving 7zip more
memory than bzip or even more than the default, although I appreciate that
you drive some users out of the market if compression memory requirements
equal decompression requirements and you start using a few gig to compress.
Also while you can (with a little effort) seek around bz2s and extract
individual blocks, PPMd's seekability isn't something I've explored.
You can't seek in PPMd because the model is trained on every single
bit. But that's true for LZMA too.
You can seek in XZ archives because the input is split in blocks (like
for bzip2), but I'm not sure whether it's a feature of LZMA2 or XZ.
Best regards,
--
Jérémie