[Foundation-l] Dump process needs serious fix / was Release of squid log data

Florence Devouard Anthere9 at yahoo.com
Sat Sep 15 23:19:18 UTC 2007


Thank you for the detailed description of the situation Erik;

Ant

Erik Zachte wrote:
> Samuel:
>> I'll also note that Erik Zachte's stats haven't been effectively run on
>> the largest wikis for some time.
> 
> People keep asking me about this, so let me elaborate on it here, rather
> than on wikitech, where it has been brought up a few times:
> 
> Dumps for the largest wikipedias have an alarming fail rate.
> To be sure: nobody is to blame in particular. The situation gradually
> deteriorated.
> 
> But it has to be said, the current sad state where many dumps, large and
> small, have failed is no exception anymore:
> see http://www.infodisiac.com/cgi-bin/WikimediaDownload.pl
> 
> This is especially true for the English Wikipedia.
> For many dumps there is at least an older valid copy, not for the English
> Wikipedia, and not so for many months.
> 
> In a year time there was >one< complete and valid English full archive dump,
> and then my script ran out of resources, processing over 1 terabyte of data,
> on an overstretched server. No second chance, a few weeks later the dump had
> vanished (AFAIK some overzealous disc housekeeping job). And apparently
> still no good offline backups.
> 
> Since then all English dumps failed, including a recent one that reported
> succes, but in thruth contains less than 5% of all data (that bug seems
> fixed).
> I heard that particular dump was used for WikiScan, which means we have many
> interesting scandals still waiting to be discovered :)
> 
> So I am waiting for good input. Notice that even if all goes well, the
> English dump job alone runs for over 6 weeks already!
> See http://download.wikimedia.org/enwiki/20070908/
> Current step started 2007-09-12 , expected time of arrival 2007-10-30. There
> is a good chance some mishap occurs before that.
> 
> More and more people expect regular and dependable dumps for research and
> statistics.
> But equally important one or our core principles, the right to fork, is vain
> without a complete and reasonably recent full archive dump.
> The board acknowledges that this is a serious issue, it is on the agenda for
> the October meeting.
> 
> Either we need specific hardware thrown at it (I can't judge whether such a
> solution would be possible at all, but I have my doubts) or manpower to
> redesign the dump process. We can't expect current developer staff to delve
> into this deeper, next to their myriad other chores. Brion gave it the
> attention he could spare.
> 
> At least two presenters at Taipei mentioned this situation as a handicap:
> Renaud Gaudin (Moulin) and Luca de Alfaro (On a Content-Driven Reputation
> System for the Wikipedia).
> 
> Erik Zachte




More information about the foundation-l mailing list