Recovery consisted of four phases:
1) "What the hell is going on?" Outage mentioned immediately by users in IRC. On investigation, the whole of PowerMedium appeared to be offline. Mark indicated that they had a major network problem. I phoned their support line; they confirmed a big network problem and said they were bringing in Charles to work on it. (This was about 3:30pm Sunday afternoon Florida time).
At this point there was nothing further we could do, we had to wait for them to fix things on their end. I also called Kyle so we'd have a pair of hands in the office when things started to come back online.
2) "Why does nothing work?"
After an hour or so they apparently had their general issues under control. PowerMedium's own web site came back up, we could get at our own switch over the network, and Charles (bw) was available online.
Between us remote folks, and bw & Kyle on-site we did some banging on rocks. We found that in addition to the network outage, there had been a power problem (presumably this killed their routers too), which had rebooted everything.
In this stage we were confronted with the fragility of the internal DNS and LDAP we had set up to make everything work. While we've expended some effort to minimize the dependencies on NFS, we hadn't yet put similar effort into these services. Until these services were restored, booting was a vveeerrryyyy slow proposition (with lots of timeout steps), and it took another hour or so to get key infrastructure back in place to where we could seriously get working.
3) "Where's my data?"
MediaWiki is highly reliant on its database backend. With machines on, we were able to start the MySQL databases loading up and running InnoDB transaction recovery. This took much longer than expected, apparently because we have a *huge* log size set on the master: about 1 gb. (James Day recommends reducing this significantly.)
While this step was running, mail was brought back online and the additional MySQL servers for text storage were brought online. Two of the slaves were found to be slightly behind, and the master log file appears to be smaller than their recorded master log offsets. This might indicate corruption of the master log file, or it might simply indicate that the position was corrupted on the slaves. In either case, this is very much non-fatal for text storage as it's very redundant and automatically falls back to the master on missing loads. (But it should be looked into. We may have write-back caching or other problems on those boxen.)
4) "Where's my site?"
Once the primary database was done, it was time to flip the switch and watch the site! Except that the Squid+LVS+Apache infrastructure is a little fragile, and in particular LVS was not set up to start automatically.
At this point it was late in Europe and our volunteer admins who do much of the squid and LVS work were asleep. I was able to find the information I needed on our internal admin documentation wiki, and got these back online after a short while.
Additionally I had to restart the IRC feeds for recent changes data, which involved recovering a tool which had gotten moved around between home directories.
Things appear to be pretty much working at this point. In the short term, we need to examine the broken MySQL slave servers, and make sure we're at full capacity.
In the medium term, we need to make sure that all services either will start automatically or can be very easily and plainly started manually.
We also *must* examine our DNS & LDAP infrastructure: if we can't make it boot fast and reliably we need to consider replacing this with something more primitive but reliable. (ewww, scp'ing hosts files...)
We also need to make sure that: * Squid error messages are easily configurable and can be updated with necessary information. * DNS can be easily updated when the Florida cluster is offline, eg so that we could redirect hits from Florida to another cluster for an error page or read-only mirror.
-- brion vibber (brion @ pobox.com)