Re: [Foundation-l] Dump process needs serious fix / was Release of squid log data

16 Sep 2007

On Sat, 15 Sep 2007, Erik Zachte wrote:

...
  People keep asking me about this, so let me elaborate
on it here, rather
 than on wikitech, where it has been brought up a few times: 
Thank you.

...
  But it has to be said, the current sad state where
many dumps, large and
 small, have failed is no exception anymore:
 see http://www.infodisiac.com/cgi-bin/WikimediaDownload.pl 
...
  So I am waiting for good input. Notice that even if
all goes well, the
 English dump job alone runs for over 6 weeks already!
 See http://download.wikimedia.org/enwiki/20070908/
 Current step started 2007-09-12 , expected time of arrival 2007-10-30.
 There is a good chance some mishap occurs before that. 
Can someone elaborate on what is going on here?  What are the steps 
involved, and why does this take so long?  It would take less time to copy 
a terabyte of data to a spare disk, drive it to a world-class computing 
cluster anywhere in the country, and have the dumps worked on there 
(including people figuring out another implementation of the dump 
process). Maybe said computing cluster could also become the de facto 
mirror-and- statistics center for Wikipedia data, where researchers would 
send complex queries to be run.

SJ

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Foundation-l] Dump process needs serious fix / was Release of squid log data