[Foundation-l] Dump process needs serious fix / was Release of squid log data

Erik Zachte erikzachte at infodisiac.com
Sat Sep 15 21:48:12 UTC 2007


Samuel:
> I'll also note that Erik Zachte's stats haven't been effectively run on
> the largest wikis for some time.

People keep asking me about this, so let me elaborate on it here, rather
than on wikitech, where it has been brought up a few times:

Dumps for the largest wikipedias have an alarming fail rate.
To be sure: nobody is to blame in particular. The situation gradually
deteriorated.

But it has to be said, the current sad state where many dumps, large and
small, have failed is no exception anymore:
see http://www.infodisiac.com/cgi-bin/WikimediaDownload.pl

This is especially true for the English Wikipedia.
For many dumps there is at least an older valid copy, not for the English
Wikipedia, and not so for many months.

In a year time there was >one< complete and valid English full archive dump,
and then my script ran out of resources, processing over 1 terabyte of data,
on an overstretched server. No second chance, a few weeks later the dump had
vanished (AFAIK some overzealous disc housekeeping job). And apparently
still no good offline backups.

Since then all English dumps failed, including a recent one that reported
succes, but in thruth contains less than 5% of all data (that bug seems
fixed).
I heard that particular dump was used for WikiScan, which means we have many
interesting scandals still waiting to be discovered :)

So I am waiting for good input. Notice that even if all goes well, the
English dump job alone runs for over 6 weeks already!
See http://download.wikimedia.org/enwiki/20070908/
Current step started 2007-09-12 , expected time of arrival 2007-10-30. There
is a good chance some mishap occurs before that.

More and more people expect regular and dependable dumps for research and
statistics.
But equally important one or our core principles, the right to fork, is vain
without a complete and reasonably recent full archive dump.
The board acknowledges that this is a serious issue, it is on the agenda for
the October meeting.

Either we need specific hardware thrown at it (I can't judge whether such a
solution would be possible at all, but I have my doubts) or manpower to
redesign the dump process. We can't expect current developer staff to delve
into this deeper, next to their myriad other chores. Brion gave it the
attention he could spare.

At least two presenters at Taipei mentioned this situation as a handicap:
Renaud Gaudin (Moulin) and Luca de Alfaro (On a Content-Driven Reputation
System for the Wikipedia).

Erik Zachte




More information about the foundation-l mailing list