[Foundation-l] Dump process needs serious fix / was Release of squid log data

Sat Sep 15 22:10:30 UTC 2007

On Sat, 15 Sep 2007, Erik Zachte wrote:

> People keep asking me about this, so let me elaborate on it here, rather
> than on wikitech, where it has been brought up a few times:

Thank you.

> But it has to be said, the current sad state where many dumps, large and
> small, have failed is no exception anymore:
> see http://www.infodisiac.com/cgi-bin/WikimediaDownload.pl

> So I am waiting for good input. Notice that even if all goes well, the
> English dump job alone runs for over 6 weeks already!
> See http://download.wikimedia.org/enwiki/20070908/
> Current step started 2007-09-12 , expected time of arrival 2007-10-30.
> There is a good chance some mishap occurs before that.

Can someone elaborate on what is going on here?  What are the steps 
involved, and why does this take so long?  It would take less time to copy 
a terabyte of data to a spare disk, drive it to a world-class computing 
cluster anywhere in the country, and have the dumps worked on there 
(including people figuring out another implementation of the dump 
process). Maybe said computing cluster could also become the de facto 
mirror-and- statistics center for Wikipedia data, where researchers would 
send complex queries to be run.

SJ