Hi;
Some weeks ago, I read about WMF had downloaded every hour log from the
Domas website. In Internet Archive are only in the date range from December
2007 to September 2009. Now, at Domas website, are available the last few
months (from April 2010 to now). So, the dta from October 2009 to March 2010
is missing.
Can be enabled a new section in download.wikimedia.org with a link to the
directory where WMF saves a copy of these logs?
Regards,
emijrp
Hello folks,
Today I shot the full history en wiki dumps that claimed they would take
three months to complete. I started a new en wiki run which tests out
two new features: running jobs step by step (in arbitrary order,
assuming any dependencies have been run) and breaking up the xml file
creation into chunks that run in parallel.
Because of this, you may notice some funkiness with the status pages
over the next little while; things like the progress line are going to
be out of whack, and I am sure we will find new and exciting bugs
(though hopefully not in the dump file output).
We have some bizarre behavior around the index pages which all seem to
claim that the given date is its own previous dump. I'll be looking
into it, but in the meantime, the old dumps are available at:
http://dumps.wikimedia.org/enwiki/20100817/http://dumps.wikimedia.org/enwiki/20100730http://dumps.wikimedia.org/enwiki/20100130http://dumps.wikimedia.org/enwiki/20100116http://dumps.wikimedia.org/enwiki/20091103
The parallelizing scheme just does dumps of n pages in sequence (where n
is arbitrarily set at 2 million right now), including all revisions or
not, depending on the dump. This shouldn't screw with anyone's code
that depends on the page IDs to be in order.
This is only being tested out on the en wiki dumps at present; all other
jobs will run just as they used to.
Ah, I wonder if anyone out there would be interested in working on
dbzip2 or seeing if it is still needed; this is a parallelizing bzip2
with some features that pbzip2 doesn't have. See
http://www.mediawiki.org/wiki/Dbzip2
This could potentially save us time in the recombine phase of the bzip2
history dumps, if people want to have those available as one file nad
not just as separate pieces.
We don't have even a start on that for 7zip, so that's another thing for
someone to look into... any takers?
Ariel Glenn