Data dumps - Wikitech-l

18 Jan 2006


      I'm doing a test run of the new data dump script on our Korean cluster;
currently jawiki (ja.wikipedia.org) is in progress:
http://amaryllis.yaseo.wikimedia.org/backup/jawiki/20060118/
Any comments on the page layout and information included in the progress page?
A couple notes:
* The file naming has been changed so they include the database name and date.
This should make it easier to figure out what the hell you just downloaded.
* The directory structure is different; the database names are used instead of
the weird mix of sites, languages, and database names which was hard to reliably
get the scripts to run. Each database has subdirectories for each day it was
dumped, plus a 'latest' subdirectory with symbolic links to the files from the
last completed dump.
* I renamed 'pages_current' and 'pages_full' to 'pages-meta-current' and
'pages-meta-history'. In addition to the big explanatory labels, this should
emphasize that these dumps contain metapages such as discussion and user pages,
distancing it from the pages-articles dump.
* I've discontinued 7-Zip compression for the current-versions dumps, since it
doesn't do better than bzip2 for those. They are still generated for the history
dump, where it compresses significantly better (about 3 vs 11 GB for enwiki)
* Upload tarballs are still not included at the moment.
The backup runner script is written in Python, and is in our CVS in the 'backup'
module should anyone feel like laughing at my code.
A few more things need to be fixed up before I start running it on the main
cluster, but it's pretty close! (A list of databases in progress, some locking,
emailing me on error, and finding the prior XML dump to speed dump generation.)
-- brion vibber (brion @ pobox.com)