Dan Collins wrote:
Apparently, for whatever reason, the master database server for enwiki got overloaded. This was following a few updates, which may have (I don't think they're sure yet) caused the problem. What actually happened was the database server had a large number of queries stuck in the 'statistics' status, leading to overload, leading to wiki down. Enwiki was set to read only, and the Almighty Tim, Patron Saint of Master Databases, arrived on the scene to heroically run the master-database switch script. The S1 (enwiki) master database was changed from db14 to db16, and db14 was removed from the slave rotation. From what I understand, db14 will need a swift kick to the power button to make it all jolly and happy again.
Ah excellent, you did my summary post for me. :)
Lots of threads being in the "statistics" state seems to be MySQL's way of saying "I've fallen and I can't get up". It's unclear exactly what set it off, but basically nothing works well until you restart it.
At 52 minutes from start of event, this took us a bit longer than I'd like to resolve -- we had to percolate through a couple levels of alert calls. (Sorry to wake you up early Tim!)
A similar event in future should be fixable within a few minutes, thanks to Tim's work on making the master-switch system more foolproof. We're fixing up our internal documentation so all our site ops will now know how to run the database master switch script next time!
Only en.wikipedia.org was affected, other than a couple of minutes where we threw the whole site to read-only while figuring out what was going on.
-- brion