Pakaran suggested on IRC the use of 7zip's LZMA compression for data
dumps, claiming really big improvements in compression over gzip. I did
some test runs with the September 17 dump of
es.wikipedia.org and can
confirm it does make a big difference:
10,995,508,118 pages_full.xml 1.00x uncompressed XML
2,320,992,228 pages_full.xml.gz 4.74x gzipped output from mwdumper
775,765,248 pages_full.xml.bz2 14.17x "bzip2"
155,983,464 pages_full.xml.7z 70.49x "7za a -si"
(gzip -9 makes a neglible difference versus the default compression
level; bzip2 -9 seems to make no difference.)
The 7za program is a fair bit slower than gzip, but at 10-15 times
better compression I suspect many people would find the download savings
worth a little extra trouble.
While it's not any official or de-facto standard that we know of, the
code is open source (LGPL, CPL) and a basic command-line archiver is
available for most Unix-like platforms as well as Windows so it should
be free to use (in the absence of surprise patents):
http://www.7-zip.org/sdk.html
I'm probably going to try to work LZMA compression into the dump process
to supplement the gzipped files; and/or we could switch from gzip back
to bzip2, which provides a still respectable improvement in compression
and is a bit more standard.
(We'd switched from bzip2 to gzip at some point in the SQL dump saga; I
think this was when we had started using gzip internally on 'old' text
entries and the extra time spent on bzip2 was wasted trying to
recompress the raw gzip data in the dumps.)
-- brion vibber (brion @
pobox.com)