Dump progress - Wikitech-l

31 Aug 2005


      There's been some questions recently about public backup dumps (or the
lack thereof). I've been working for the last few days on getting the
dump generation infrastructure up and running in a more consistent,
effective fashion.
Here's what's currently on my plate and the status thereof:
* Title corrections: some of the databases contain invalid page titles
left over from old bugs. This can sometimes break the import or export
process, so I'm writing a fixup script to find and rename them.
STATUS: Finding done, fixing to come.
Should be done with this later today.
* Dump filtering/processing: currently the dump has to run twice to
produce the current-only and all-revisions dump files. I'm working on a
postprocessing tool which will be able to split these two from a single
runthrough, as well as produce a filtered dump with the talk and user
pages removed.
Producing the split versions from one run should also mean that the dump
can run without having replication stopped the whole time.
It can also produce SQL for importing a dump directly into a database in
either 1.4 or 1.5 schema, for those using software based on the old
database layout. (We probably won't be hosting such files on our server
but you can run the program locally to filter XML-to-MySQL.)
STATUS: Mostly done. Some more testing and actually hooking up multiple
simultaneous outputs remains.
Should be done tonight or tomorrow.
* Progress and error reporting: The old backup script was a hacky shell
script with no error detection or recovery, requireing manually stopping
replication on a database server and reconfiguring the wiki cluster for
the duration. If something went awry, maybe nobody noticed... the
hackiness of this is a large part of why we've never just let it run
automatically on a cronjob.
I want to rework this for better automation and to provide useful
indications of what it's doing, where it's up to, and if something went
wrong.
STATUS: Not yet started. Hope to have done tomorrow or Friday.
* Clean up download.wikimedia.org further, make use of status files left
by the updated backup runner script.
STATUS: Not yet started. (Doesn't have to be up before the backup starts.)
-- brion vibber (brion @ pobox.com)