There's been some questions recently about public backup dumps (or the lack thereof). I've been working for the last few days on getting the dump generation infrastructure up and running in a more consistent, effective fashion.
Here's what's currently on my plate and the status thereof:
* Title corrections: some of the databases contain invalid page titles left over from old bugs. This can sometimes break the import or export process, so I'm writing a fixup script to find and rename them.
STATUS: Finding done, fixing to come. Should be done with this later today.
* Dump filtering/processing: currently the dump has to run twice to produce the current-only and all-revisions dump files. I'm working on a postprocessing tool which will be able to split these two from a single runthrough, as well as produce a filtered dump with the talk and user pages removed.
Producing the split versions from one run should also mean that the dump can run without having replication stopped the whole time.
It can also produce SQL for importing a dump directly into a database in either 1.4 or 1.5 schema, for those using software based on the old database layout. (We probably won't be hosting such files on our server but you can run the program locally to filter XML-to-MySQL.)
STATUS: Mostly done. Some more testing and actually hooking up multiple simultaneous outputs remains. Should be done tonight or tomorrow.
* Progress and error reporting: The old backup script was a hacky shell script with no error detection or recovery, requireing manually stopping replication on a database server and reconfiguring the wiki cluster for the duration. If something went awry, maybe nobody noticed... the hackiness of this is a large part of why we've never just let it run automatically on a cronjob.
I want to rework this for better automation and to provide useful indications of what it's doing, where it's up to, and if something went wrong.
STATUS: Not yet started. Hope to have done tomorrow or Friday.
* Clean up download.wikimedia.org further, make use of status files left by the updated backup runner script.
STATUS: Not yet started. (Doesn't have to be up before the backup starts.)
-- brion vibber (brion @ pobox.com)
On 8/31/05, Brion Vibber brion@pobox.com wrote:
There's been some questions recently about public backup dumps (or the lack thereof). I've been working for the last few days on getting the dump generation infrastructure up and running in a more consistent, effective fashion.
Should contributions to peripheral things like this be limited to PHP? If I had a Python contribution, would it be used?
Also, on PostgreSQL schema--- is this being worked, or would it be useful for me to work on it? I'm doing a project using WP data on a psql db.
Jeremy Dunck wrote:
On 8/31/05, Brion Vibber brion@pobox.com wrote:
There's been some questions recently about public backup dumps (or the lack thereof). I've been working for the last few days on getting the dump generation infrastructure up and running in a more consistent, effective fashion.
Should contributions to peripheral things like this be limited to PHP? If I had a Python contribution, would it be used?
Various supplementary things are in Python, C, C++, Java, and C#. And of course our TeX processor is in Objective CAML. ;)
Also, on PostgreSQL schema--- is this being worked, or would it be useful for me to work on it? I'm doing a project using WP data on a psql db.
Domas had ported MediaWiki 1.4 to PostgreSQL, minus some niceties like the installer. It hasn't been maintained however and the schema files are almost certainly not updated for 1.5.
If you'd like to help get this up to speed in CVS that would be cool; Kate has added some bits for supporting Oracle as well and there may or may not be a little more infrastructure now for supporting multiple backends with the installation and updaters.
-- brion vibber (brion @ pobox.com)
I wrote a couple days ago:
- Progress and error reporting: The old backup script was a hacky shell
script with no error detection or recovery, requireing manually stopping replication on a database server and reconfiguring the wiki cluster for the duration. If something went awry, maybe nobody noticed... the hackiness of this is a large part of why we've never just let it run automatically on a cronjob.
I want to rework this for better automation and to provide useful indications of what it's doing, where it's up to, and if something went wrong.
STATUS: Not yet started. Hope to have done tomorrow or Friday.
A semi-experimental backup run is in progress now.
Try for instance: http://download.wikimedia.org/special/sources/
For your amusement there's a log for each wiki: http://download.wikimedia.org/special/sources/backup.log
There's also now a page dump which excludes user pages and talk pages in addition to the full and all-current sets: http://download.wikimedia.org/special/sources/pages_public.xml.gz
And uploaded files should be included again: http://download.wikimedia.org/special/sources/upload.tar
It's entirely possible that there are still horrible problems and we'll have to run it over, of course. :)
- Clean up download.wikimedia.org further, make use of status files left
by the updated backup runner script.
STATUS: Not yet started. (Doesn't have to be up before the backup starts.)
Haven't gotten to this yet, will try by the end of the weekend if no one else tries their hand.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org