A couple of days ago Ariel, the database master for en.wikipedia, crashed in some way for a brief time. This was not logged in detail in the admin log, so I don't really know what happened as I was out at the time.
Today, when applying a full PHP update and sync to the software, en.wikipedia suddenly began to display a read-only message about being locked due to a server crash. It was determined that the database configuration file had been edited, but then not synchronized, during the earlier crisis. The read-only message was removed and the file resynchronized, opening the wiki back for editing.
Unfortunately, there was a combination of two other problems:
1) The config file also had ariel commented out, presumably to avoid error messages during the temporary crisis two days ago. As a result, the next server in the list was considered to be the master by the software.
2) The database slaves for en.wikipedia were misconfigured. All slaves *MUST* be kept in read_only mode or there is a high risk of data corruption in the case of wiki configuration errors.
They were *not* set in read_only mode, so db4, one of the slaves, ended up accepting edits for 40 minutes.
Further, edits went back on ariel for several minutes while figuring out what had happened.
en.wikipedia is currently locked while we examine the databases to see whether recovery of the last 40 minutes' work is feasible.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Further, edits went back on ariel for several minutes while figuring out what had happened.
en.wikipedia is currently locked while we examine the databases to see whether recovery of the last 40 minutes' work is feasible.
Tim tried running the slave de-sync script, but it was slow enough it could take two or three hours to complete.
To get things back on their feet more quickly, we've bumped the revision ID position on the master so new edits won't interfere, and the recovery can continue in the background.
There were about 3500 affected edits; in theory most should be recoverable without significant conflicts. A few might possibly have conflicting id numbers and get scrapped.
en.wikipedia is back read-only. There may be some broken diffs and bad cached pages.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
There were about 3500 affected edits; in theory most should be recoverable without significant conflicts. A few might possibly have conflicting id numbers and get scrapped.
en.wikipedia is back read-only. There may be some broken diffs and bad cached pages.
Note that some affected pages currently show the wrong contents and are uneditable. This is a known problem and should be resolved within a couple hours.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
Brion Vibber wrote:
There were about 3500 affected edits; in theory most should be recoverable without significant conflicts. A few might possibly have conflicting id numbers and get scrapped.
en.wikipedia is back read-only. There may be some broken diffs and bad cached pages.
Note that some affected pages currently show the wrong contents and are uneditable. This is a known problem and should be resolved within a couple hours.
All pages should now be editable. There might still be a few broken prev/next links in diff display, due to out-of-order rev_ids, but this is hard to fix and I'm inclined to think it's no big deal. Diff links from history should work fine. I reorganised fixSlaveDesync.php so that it runs in minutes rather than hours, and improved its concurrency handling, and then ran it to recover lost revisions. I also ran some queries to fix another category of broken page. All other changes, such as article and account creations, user blocks and page moves, made during the 40 minutes in question, will be lost.
There may still be some cached broken diffs and history pages, so remember to clear the cache with action=purge or a null edit before reporting any bugs.
All this applies to en.wikipedia.org only.
-- Tim Starling
wikitech-l@lists.wikimedia.org