[Wikitech-l] Initial post-mortem notes for April 9, 2006 outage

10 Apr 2006


      Recovery consisted of four phases:
1) "What the hell is going on?"
Outage mentioned immediately by users in IRC. On investigation, the whole of
PowerMedium appeared to be offline. Mark indicated that they had a major network
problem. I phoned their support line; they confirmed a big network problem and
said they were bringing in Charles to work on it. (This was about 3:30pm Sunday
afternoon Florida time).
At this point there was nothing further we could do, we had to wait for them to
fix things on their end. I also called Kyle so we'd have a pair of hands in the
office when things started to come back online.
2) "Why does nothing work?"
After an hour or so they apparently had their general issues under control.
PowerMedium's own web site came back up, we could get at our own switch over the
network, and Charles (bw) was available online.
Between us remote folks, and bw & Kyle on-site we did some banging on rocks. We
found that in addition to the network outage, there had been a power problem
(presumably this killed their routers too), which had rebooted everything.
In this stage we were confronted with the fragility of the internal DNS and LDAP
we had set up to make everything work. While we've expended some effort to
minimize the dependencies on NFS, we hadn't yet put similar effort into these
services. Until these services were restored, booting was a vveeerrryyyy slow
proposition (with lots of timeout steps), and it took another hour or so to get
key infrastructure back in place to where we could seriously get working.
3) "Where's my data?"
MediaWiki is highly reliant on its database backend. With machines on, we were
able to start the MySQL databases loading up and running InnoDB transaction
recovery. This took much longer than expected, apparently because we have a
*huge* log size set on the master: about 1 gb. (James Day recommends reducing
this significantly.)
While this step was running, mail was brought back online and the additional
MySQL servers for text storage were brought online. Two of the slaves were found
to be slightly behind, and the master log file appears to be smaller than their
recorded master log offsets. This might indicate corruption of the master log
file, or it might simply indicate that the position was corrupted on the slaves.
In either case, this is very much non-fatal for text storage as it's very
redundant and automatically falls back to the master on missing loads. (But it
should be looked into. We may have write-back caching or other problems on those
boxen.)
4) "Where's my site?"
Once the primary database was done, it was time to flip the switch and watch the
site! Except that the Squid+LVS+Apache infrastructure is a little fragile, and
in particular LVS was not set up to start automatically.
At this point it was late in Europe and our volunteer admins who do much of the
squid and LVS work were asleep. I was able to find the information I needed on
our internal admin documentation wiki, and got these back online after a short
while.
Additionally I had to restart the IRC feeds for recent changes data, which
involved recovering a tool which had gotten moved around between home directories.
Things appear to be pretty much working at this point. In the short term, we
need to examine the broken MySQL slave servers, and make sure we're at full
capacity.
In the medium term, we need to make sure that all services either will start
automatically or can be very easily and plainly started manually.
We also *must* examine our DNS & LDAP infrastructure: if we can't make it boot
fast and reliably we need to consider replacing this with something more
primitive but reliable. (ewww, scp'ing hosts files...)
We also need to make sure that:
* Squid error messages are easily configurable and can be updated with necessary
information.
* DNS can be easily updated when the Florida cluster is offline, eg so that we
could redirect hits from Florida to another cluster for an error page or
read-only mirror.
-- brion vibber (brion @ pobox.com)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

[Wikitech-l] Initial post-mortem notes for April 9, 2006 outage