Hello folks,
So this has been running all of one day now, and I expect it to break in wild and crazy ways over the next period while we get the bugs out. But, throwing caution to the winds...
I'm generating dumps each day for each non-closed non-private project, of revisions added since the previous day. It uses the standard xml format, writing out stubs and history files.
This is a sort of poor person's incremental dump. What do I mean by that? Well... It doesn't contain a list of deletions, page moves, undeletes. It just dumps the metadata and text for every revision between X1 (last revision dumped the day before) and X2 (last revision in db as of the time it's dumped). The reason for that? Dumping a range of revisions is relatively easy. Accounting for page deletions, moves etc. since the previous dump is hard, so that is an exercise left to the reader :-P
Even with these limitations I'm hoping the data will be useful to folks.
These are specifically *not* intended to be kept around forever; we'll keep some reasonable number, 20-30 of them, and then start tossing old ones after that.
A note about the timing of the dumps: they run once a day, there's no progress reporting. An updated index file is published near the end of the day. Also, we dump content with a delay of 12 hours, to allow admins to delete things that might contain sensitive information. This was less of a concern for dumps generated once a week, but daily runs increase the odds of something bad getting dumped.
And speaking of the index file, it's here: http://dumps.wikimedia.org/other/incr/
Guess I'll add some documentation on wikitech too. The code is in my branch in svn, see http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/inc...
I may well be patching things tomorrow at this time for jobs that failed to run, so feel free to point out issues, but also don't be surprised by frequent outages.
Happy trails,
Ariel