Hey !
May I mention that the scripts generating the dumps and handling the scheduling are written in Python and are available on wikimedia svn ? [1]
If you have some improvements to suggest on the task scheduling, I guess that patches are welcome :)
In may, following another wikitech-l discussion [2] some small improvements were done on the dump processing, to prioritize the dumps that haven't been successfully dumped in a long time. Previously, we were not taking into account the fact that some dump attempts failed, only ordering the dumps by "last dump try start time", leading to some inconsistencies.
If I'm right, I think that you should also consider the fact that the Xml dumping process is also basing itself on the previous dumps to be faster: in other words, if you have a recent Xml dump, it is faster to work with that existing dump because you can fetch text records from the old dump instead of fetching them from the external storage which also requires normalizing and decompressing. Here, the latest dump available for enwiki is from July, meaning a lot of new text to fetch from external storage: this first dump *will* take a long time, but you should expect the next dumps to go faster.
[1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/backup/ [2] http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/38401/...
2008/10/11 Anthony wikimail@inbox.org:
On Fri, Oct 10, 2008 at 7:49 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
I guess the answer, really, is to get more servers doing dumps - I'm sure that will come in time.
No, the answer, really, is to do the dumps more efficiently. Brion says this should come in the next couple months.
Anthony _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l