On 24/11/11 11:00, Ariel T. Glenn wrote:
Hello folks,
So this has been running all of one day now, and I expect it to break in wild and crazy ways over the next period while we get the bugs out. But, throwing caution to the winds...
I'm generating dumps each day for each non-closed non-private project, of revisions added since the previous day. It uses the standard xml format, writing out stubs and history files.
This is a sort of poor person's incremental dump. What do I mean by that? Well... It doesn't contain a list of deletions, page moves, undeletes. It just dumps the metadata and text for every revision between X1 (last revision dumped the day before) and X2 (last revision in db as of the time it's dumped). The reason for that? Dumping a range of revisions is relatively easy. Accounting for page deletions, moves etc. since the previous dump is hard, so that is an exercise left to the reader :-P
Even with these limitations I'm hoping the data will be useful to folks.
Looks good. I think this can be very useful for the toolserver, were we have all the db metadata except the text. I think I'll make up something. (downloading now)
I may well be patching things tomorrow at this time for jobs that failed to run, so feel free to point out issues, but also don't be surprised by frequent outages.
The md5sum file lacks the filenames. :)
The code is in my branch in svn, see http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/inc...
The .dblist files in the branch are dummy. What are checkforbz2footer and writeuptopageid tools? (Which seem unused, btw)