7zip on *-pages-articles - Xmldatadumps-l

2 Sep 2012

Hi dump producers,

I know there's more to the choice of compression format than the size of 
the resulting dumps (e.g. time, memory, portability, existing code 
investment) and I read that you looked at LZMA and found it to be of 
insignificant benefit [1], but I noticed over at the Large Text 
Compression Benchmark site that they use 7-zip in PPMd mode and did some 
trial recompressions.

The bzip dumps use 900k blocks and according to the bzip2.org 
implementation's manual it takes around 7600k while compressing and 
around 3700k while decompressing.  Like LZMA, PPMd apparently uses the 
same amount of memory for decompression as it used during compression, 
so I recompressed the XML dump with various amounts of memory so you can 
make your own comparisons.

Specifically, using 7zip 9.20 from Ubuntu Precise's p7zip-full, I ran:

    for MEM in 3700k 7600k 16m 512m; do
        bzcat enwiki-20120802-pages-articles.xml.bz2 \
        | 7z -a -si -m0=PPMd:mem=$MEM \
          enwiki-20120802-pages-articles.xml.$MEM.7z
    done

    bzcat enwiki-20120802-pages-articles.xml.bz2 \
    | 7z -a -si enwiki-20120802-pages-articles.xml.LZMA.7z

for the following resulting file sizes in bytes (% of .bz2 version):

  original bz2:  9143865996
  $MEM=3700k :   8648303296  (94.6%)
  $MEM=7600k :   8043626528  (88.0%)
  $MEM=16m :     7910637814  (86.5%) (the default for both PPMd & LZMA)
  LZMA:          7705327210  (84.3%)
  $MEM=512m :    7076755355  (77.4%)

I wasn't looking to compare running times and absolute values wouldn't 
compare to your servers but for what it's worth I noticed that LZMA took 
over twice as long as any PPMd run.  I was expecting PPMd to beat LZMA, 
hence the several PPMd runs.

There's probably some value in experimenting with PPMd's "model order" 
too, which I didn't try.  Google "model order for PPMd" or see 
/usr/share/doc/p7zip-full/DOCS/MANUAL/switches/method.htm#PPMd (on 
Debian/Ubuntu).

As the dump servers only have to do it once to save that bandwidth for 
every download from every mirror that month, perhaps it's worth giving 
7zip more memory than bzip or even more than the default, although I 
appreciate that you drive some users out of the market if compression 
memory requirements equal decompression requirements and you start using 
a few gig to compress.  Also while you can (with a little effort) seek 
around bz2s and extract individual blocks, PPMd's seekability isn't 
something I've explored.

Just some thoughts.

Zak

[1]: "7-Zip's LZMA compression produces significantly smaller files for 
the full-history dumps, but doesn't do better than bzip2 for our other 
files." --- http://meta.wikimedia.org/wiki/Data_dumps