Tomasz Finc wrote:
Commons finished just fine along with every single one of the other small & mid size wiki's waiting to be picked up. Now were just left with the big sized wiki's to finish.
The new dump processes started on May 1 and sped up to twelve processes on May 4. As of yesterday May 7, dumps have started on all databases. While the big ones (enwiki, dewiki, ...) are still running, tokiponawiktionary is the first to have its second dump in this round. They were produced on May 1 and 7. Soon, all small and medium sized databases will have multiple dumps, with roughly 4 day intervals. This is a real improvement over the previous 12 months, and I really hope we don't fall down again.
Now, to be even more useful, database dumps should be produced on *regular* intervals. That way, we can compare various measures such as article growth, link counts or usage of certain words, without having to introduce the exact dump time in the count.
An easy way to implement this is to delay the next dump of a database to exactly one week after the previous dump started.
For example, the last dump of svwiki (Swedish Wikipedia) started at 20:48 (UTC) on Tuesday May 5. So let this time of week (20:48 on Tuesdays) be the timeslot for svwiki. If its turn comes up any earlier, the next dump should be delayed until 20:48 on May 12.
That way, the number of mentions of "EU parliament" (elections are due on June 7) can be compared on a weekly (7 day) basis, rather than on a 5-and-a-half day basis. The 7 day interval removes any measurement bias from weekday/weekend variations.
Another advantage is that we can expect new dumps of svwiki by Wednesday lunch, and can plan our weekly projects accordingly.
This plan does not help the larger projects, which take many days to dump. They would still benefit from optimiziations of the dump process itself. Right now the enwiki is extracting "page abstracts for Yahoo" and will continue to do so until May 21. I really hope Yahoo appreciates this, or else the current dump should be advanced to its next stage to save days and weeks. Maybe the pages-articles.xml part of the dump can be produced on a regular weekly (or fortnightly) basis even for the larger projects, while the other parts are produced more seldom.