There's been some questions recently about public backup dumps (or the lack thereof). I've been working for the last few days on getting the dump generation infrastructure up and running in a more consistent, effective fashion.
Here's what's currently on my plate and the status thereof:
* Title corrections: some of the databases contain invalid page titles left over from old bugs. This can sometimes break the import or export process, so I'm writing a fixup script to find and rename them.
STATUS: Finding done, fixing to come. Should be done with this later today.
* Dump filtering/processing: currently the dump has to run twice to produce the current-only and all-revisions dump files. I'm working on a postprocessing tool which will be able to split these two from a single runthrough, as well as produce a filtered dump with the talk and user pages removed.
Producing the split versions from one run should also mean that the dump can run without having replication stopped the whole time.
It can also produce SQL for importing a dump directly into a database in either 1.4 or 1.5 schema, for those using software based on the old database layout. (We probably won't be hosting such files on our server but you can run the program locally to filter XML-to-MySQL.)
STATUS: Mostly done. Some more testing and actually hooking up multiple simultaneous outputs remains. Should be done tonight or tomorrow.
* Progress and error reporting: The old backup script was a hacky shell script with no error detection or recovery, requireing manually stopping replication on a database server and reconfiguring the wiki cluster for the duration. If something went awry, maybe nobody noticed... the hackiness of this is a large part of why we've never just let it run automatically on a cronjob.
I want to rework this for better automation and to provide useful indications of what it's doing, where it's up to, and if something went wrong.
STATUS: Not yet started. Hope to have done tomorrow or Friday.
* Clean up download.wikimedia.org further, make use of status files left by the updated backup runner script.
STATUS: Not yet started. (Doesn't have to be up before the backup starts.)
-- brion vibber (brion @ pobox.com)