en.wikipedia database broken a bit - Wikitech-l

21 Jul 2006


      A couple of days ago Ariel, the database master for en.wikipedia, crashed in
some way for a brief time. This was not logged in detail in the admin log, so I
don't really know what happened as I was out at the time.
Today, when applying a full PHP update and sync to the software, en.wikipedia
suddenly began to display a read-only message about being locked due to a server
crash. It was determined that the database configuration file had been edited,
but then not synchronized, during the earlier crisis. The read-only message was
removed and the file resynchronized, opening the wiki back for editing.
Unfortunately, there was a combination of two other problems:
1) The config file also had ariel commented out, presumably to avoid error
messages during the temporary crisis two days ago. As a result, the next server
in the list was considered to be the master by the software.
2) The database slaves for en.wikipedia were misconfigured. All slaves *MUST* be
kept in read_only mode or there is a high risk of data corruption in the case of
wiki configuration errors.
They were *not* set in read_only mode, so db4, one of the slaves, ended up
accepting edits for 40 minutes.
Further, edits went back on ariel for several minutes while figuring out what
had happened.
en.wikipedia is currently locked while we examine the databases to see whether
recovery of the last 40 minutes' work is feasible.
-- brion vibber (brion @ pobox.com)