On Sat, 15 Sep 2007, Erik Zachte wrote:
People keep asking me about this, so let me elaborate
on it here, rather
than on wikitech, where it has been brought up a few times:
Thank you.
But it has to be said, the current sad state where
many dumps, large and
small, have failed is no exception anymore:
see
http://www.infodisiac.com/cgi-bin/WikimediaDownload.pl
So I am waiting for good input. Notice that even if
all goes well, the
English dump job alone runs for over 6 weeks already!
See
http://download.wikimedia.org/enwiki/20070908/
Current step started 2007-09-12 , expected time of arrival 2007-10-30.
There is a good chance some mishap occurs before that.
Can someone elaborate on what is going on here? What are the steps
involved, and why does this take so long? It would take less time to copy
a terabyte of data to a spare disk, drive it to a world-class computing
cluster anywhere in the country, and have the dumps worked on there
(including people figuring out another implementation of the dump
process). Maybe said computing cluster could also become the de facto
mirror-and- statistics center for Wikipedia data, where researchers would
send complex queries to be run.
SJ