Hello folks,
So this has been running all of one day now, and I expect it to break in wild and crazy ways over the next period while we get the bugs out. But, throwing caution to the winds...
I'm generating dumps each day for each non-closed non-private project, of revisions added since the previous day. It uses the standard xml format, writing out stubs and history files.
This is a sort of poor person's incremental dump. What do I mean by that? Well... It doesn't contain a list of deletions, page moves, undeletes. It just dumps the metadata and text for every revision between X1 (last revision dumped the day before) and X2 (last revision in db as of the time it's dumped). The reason for that? Dumping a range of revisions is relatively easy. Accounting for page deletions, moves etc. since the previous dump is hard, so that is an exercise left to the reader :-P
Even with these limitations I'm hoping the data will be useful to folks.
These are specifically *not* intended to be kept around forever; we'll keep some reasonable number, 20-30 of them, and then start tossing old ones after that.
A note about the timing of the dumps: they run once a day, there's no progress reporting. An updated index file is published near the end of the day. Also, we dump content with a delay of 12 hours, to allow admins to delete things that might contain sensitive information. This was less of a concern for dumps generated once a week, but daily runs increase the odds of something bad getting dumped.
And speaking of the index file, it's here: http://dumps.wikimedia.org/other/incr/
Guess I'll add some documentation on wikitech too. The code is in my branch in svn, see http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/inc...
I may well be patching things tomorrow at this time for jobs that failed to run, so feel free to point out issues, but also don't be surprised by frequent outages.
Happy trails,
Ariel
Hi Ariel,
2011/11/24 Ariel T. Glenn ariel@wikimedia.org:
I'm generating dumps each day for each non-closed non-private project, of revisions added since the previous day. It uses the standard xml format, writing out stubs and history files.
[…]
Even with these limitations I'm hoping the data will be useful to folks.
That's great; no doubt I'll use it soon for both maintainance projects and copyvio detection.
Thanks!
Hi Ariel,
This is great news and a first but big step towards incremental backups :)
Diederik
On 2011-11-24, at 5:21 AM, Jérémie Roquet wrote:
Hi Ariel,
2011/11/24 Ariel T. Glenn ariel@wikimedia.org:
I'm generating dumps each day for each non-closed non-private project, of revisions added since the previous day. It uses the standard xml format, writing out stubs and history files.
[…]
Even with these limitations I'm hoping the data will be useful to folks.
That's great; no doubt I'll use it soon for both maintainance projects and copyvio detection.
Thanks!
-- Jérémie
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 24/11/11 11:00, Ariel T. Glenn wrote:
Hello folks,
So this has been running all of one day now, and I expect it to break in wild and crazy ways over the next period while we get the bugs out. But, throwing caution to the winds...
I'm generating dumps each day for each non-closed non-private project, of revisions added since the previous day. It uses the standard xml format, writing out stubs and history files.
This is a sort of poor person's incremental dump. What do I mean by that? Well... It doesn't contain a list of deletions, page moves, undeletes. It just dumps the metadata and text for every revision between X1 (last revision dumped the day before) and X2 (last revision in db as of the time it's dumped). The reason for that? Dumping a range of revisions is relatively easy. Accounting for page deletions, moves etc. since the previous dump is hard, so that is an exercise left to the reader :-P
Even with these limitations I'm hoping the data will be useful to folks.
Looks good. I think this can be very useful for the toolserver, were we have all the db metadata except the text. I think I'll make up something. (downloading now)
I may well be patching things tomorrow at this time for jobs that failed to run, so feel free to point out issues, but also don't be surprised by frequent outages.
The md5sum file lacks the filenames. :)
The code is in my branch in svn, see http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/inc...
The .dblist files in the branch are dummy. What are checkforbz2footer and writeuptopageid tools? (Which seem unused, btw)
xmldatadumps-l@lists.wikimedia.org