So this has been running all of one day now, and I expect it to break in
wild and crazy ways over the next period while we get the bugs out.
But, throwing caution to the winds...
I'm generating dumps each day for each non-closed non-private project,
of revisions added since the previous day. It uses the standard xml
format, writing out stubs and history files.
This is a sort of poor person's incremental dump. What do I mean by
that? Well... It doesn't contain a list of deletions, page moves,
undeletes. It just dumps the metadata and text for every revision
between X1 (last revision dumped the day before) and X2 (last revision
in db as of the time it's dumped). The reason for that? Dumping a
range of revisions is relatively easy. Accounting for page deletions,
moves etc. since the previous dump is hard, so that is an exercise left
to the reader :-P
Even with these limitations I'm hoping the data will be useful to folks.
These are specifically *not* intended to be kept around forever; we'll
keep some reasonable number, 20-30 of them, and then start tossing old
ones after that.
A note about the timing of the dumps: they run once a day, there's no
progress reporting. An updated index file is published near the end of
the day. Also, we dump content with a delay of 12 hours, to allow
admins to delete things that might contain sensitive information. This
was less of a concern for dumps generated once a week, but daily runs
increase the odds of something bad getting dumped.
And speaking of the index file, it's here:
Guess I'll add some documentation on wikitech too. The code is in my
branch in svn, see
I may well be patching things tomorrow at this time for jobs that failed
to run, so feel free to point out issues, but also don't be surprised by