Hello folks,
Today I shot the full history en wiki dumps that claimed they would take three months to complete. I started a new en wiki run which tests out two new features: running jobs step by step (in arbitrary order, assuming any dependencies have been run) and breaking up the xml file creation into chunks that run in parallel.
Because of this, you may notice some funkiness with the status pages over the next little while; things like the progress line are going to be out of whack, and I am sure we will find new and exciting bugs (though hopefully not in the dump file output).
We have some bizarre behavior around the index pages which all seem to claim that the given date is its own previous dump. I'll be looking into it, but in the meantime, the old dumps are available at:
http://dumps.wikimedia.org/enwiki/20100817/ http://dumps.wikimedia.org/enwiki/20100730 http://dumps.wikimedia.org/enwiki/20100130 http://dumps.wikimedia.org/enwiki/20100116 http://dumps.wikimedia.org/enwiki/20091103
The parallelizing scheme just does dumps of n pages in sequence (where n is arbitrarily set at 2 million right now), including all revisions or not, depending on the dump. This shouldn't screw with anyone's code that depends on the page IDs to be in order.
This is only being tested out on the en wiki dumps at present; all other jobs will run just as they used to.
Ah, I wonder if anyone out there would be interested in working on dbzip2 or seeing if it is still needed; this is a parallelizing bzip2 with some features that pbzip2 doesn't have. See http://www.mediawiki.org/wiki/Dbzip2 This could potentially save us time in the recombine phase of the bzip2 history dumps, if people want to have those available as one file nad not just as separate pieces. We don't have even a start on that for 7zip, so that's another thing for someone to look into... any takers?
Ariel Glenn
wikitech-l@lists.wikimedia.org