Pakaran suggested on IRC the use of 7zip's LZMA compression for data dumps, claiming really big improvements in compression over gzip. I did some test runs with the September 17 dump of es.wikipedia.org and can confirm it does make a big difference:
10,995,508,118 pages_full.xml 1.00x uncompressed XML 2,320,992,228 pages_full.xml.gz 4.74x gzipped output from mwdumper 775,765,248 pages_full.xml.bz2 14.17x "bzip2" 155,983,464 pages_full.xml.7z 70.49x "7za a -si"
(gzip -9 makes a neglible difference versus the default compression level; bzip2 -9 seems to make no difference.)
The 7za program is a fair bit slower than gzip, but at 10-15 times better compression I suspect many people would find the download savings worth a little extra trouble.
While it's not any official or de-facto standard that we know of, the code is open source (LGPL, CPL) and a basic command-line archiver is available for most Unix-like platforms as well as Windows so it should be free to use (in the absence of surprise patents): http://www.7-zip.org/sdk.html
I'm probably going to try to work LZMA compression into the dump process to supplement the gzipped files; and/or we could switch from gzip back to bzip2, which provides a still respectable improvement in compression and is a bit more standard.
(We'd switched from bzip2 to gzip at some point in the SQL dump saga; I think this was when we had started using gzip internally on 'old' text entries and the extra time spent on bzip2 was wasted trying to recompress the raw gzip data in the dumps.)
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
While it's not any official or de-facto standard that we know of, the code is open source (LGPL, CPL) and a basic command-line archiver is available for most Unix-like platforms as well as Windows so it should be free to use (in the absence of surprise patents): http://www.7-zip.org/sdk.html
I've had good experiences with 7-zip under Windows; didn't know there was a *nix tool, which kept me from using it more often.
I'm probably going to try to work LZMA compression into the dump process to supplement the gzipped files; and/or we could switch from gzip back to bzip2, which provides a still respectable improvement in compression and is a bit more standard.
Which reminds me: Why do people have to download the whole "XML'd" database every time they want to update? There should be a way to make smaller packages (like "pre-2003" or "2005-01") for all revisions created in that timeframe. These could then be patched together and updated with the latest package for those who need the old revisions at home :-)
Taking this idea further, the database seems to hold up fine right now, but with expotential growth come lots'o'revisions. At some point, we might want to add a "rev_on_disk" field to the revisions table, and move the text of revisions older than, say, 3 month to the file system (file name generated through article and revision ID). That would save lots of space in the database, and not interfere with important ongoing operations like revert wars :-) and still keep the "really old" versions accessible.
Not making much sense, am I? I need more coffee...
Magnus
Magnus Manske wrote:
At some point, we might want to add a "rev_on_disk" field to the revisions table, and move the text of revisions older than, say, 3 month to the file system (file name generated through article and revision ID).
If and when you are going to do this, hopefully you are going to use the page_id, not the article title, as otherwise you will have to rename potentially arbitrary amounts of files upon a page move, and you will potentially make it impossible to run on Windows due to its filename character restrictions.
wikitech-l@lists.wikimedia.org