Recovery consisted of four phases:
1) "What the hell is going on?" Outage mentioned immediately by users in IRC. On investigation, the whole of PowerMedium appeared to be offline. Mark indicated that they had a major network problem. I phoned their support line; they confirmed a big network problem and said they were bringing in Charles to work on it. (This was about 3:30pm Sunday afternoon Florida time).
At this point there was nothing further we could do, we had to wait for them to fix things on their end. I also called Kyle so we'd have a pair of hands in the office when things started to come back online.
2) "Why does nothing work?"
After an hour or so they apparently had their general issues under control. PowerMedium's own web site came back up, we could get at our own switch over the network, and Charles (bw) was available online.
Between us remote folks, and bw & Kyle on-site we did some banging on rocks. We found that in addition to the network outage, there had been a power problem (presumably this killed their routers too), which had rebooted everything.
In this stage we were confronted with the fragility of the internal DNS and LDAP we had set up to make everything work. While we've expended some effort to minimize the dependencies on NFS, we hadn't yet put similar effort into these services. Until these services were restored, booting was a vveeerrryyyy slow proposition (with lots of timeout steps), and it took another hour or so to get key infrastructure back in place to where we could seriously get working.
3) "Where's my data?"
MediaWiki is highly reliant on its database backend. With machines on, we were able to start the MySQL databases loading up and running InnoDB transaction recovery. This took much longer than expected, apparently because we have a *huge* log size set on the master: about 1 gb. (James Day recommends reducing this significantly.)
While this step was running, mail was brought back online and the additional MySQL servers for text storage were brought online. Two of the slaves were found to be slightly behind, and the master log file appears to be smaller than their recorded master log offsets. This might indicate corruption of the master log file, or it might simply indicate that the position was corrupted on the slaves. In either case, this is very much non-fatal for text storage as it's very redundant and automatically falls back to the master on missing loads. (But it should be looked into. We may have write-back caching or other problems on those boxen.)
4) "Where's my site?"
Once the primary database was done, it was time to flip the switch and watch the site! Except that the Squid+LVS+Apache infrastructure is a little fragile, and in particular LVS was not set up to start automatically.
At this point it was late in Europe and our volunteer admins who do much of the squid and LVS work were asleep. I was able to find the information I needed on our internal admin documentation wiki, and got these back online after a short while.
Additionally I had to restart the IRC feeds for recent changes data, which involved recovering a tool which had gotten moved around between home directories.
Things appear to be pretty much working at this point. In the short term, we need to examine the broken MySQL slave servers, and make sure we're at full capacity.
In the medium term, we need to make sure that all services either will start automatically or can be very easily and plainly started manually.
We also *must* examine our DNS & LDAP infrastructure: if we can't make it boot fast and reliably we need to consider replacing this with something more primitive but reliable. (ewww, scp'ing hosts files...)
We also need to make sure that: * Squid error messages are easily configurable and can be updated with necessary information. * DNS can be easily updated when the Florida cluster is offline, eg so that we could redirect hits from Florida to another cluster for an error page or read-only mirror.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
We also need to make sure that:
- Squid error messages are easily configurable and can be updated with necessary
information.
This is just a matter of overwriting files in /usr/share/squid/errors/ on all squid servers. I think there's a script to do that somewhere, but of course it can be done manually when the Florida cluster is down.
- DNS can be easily updated when the Florida cluster is offline, eg so that we
could redirect hits from Florida to another cluster for an error page or read-only mirror.
This is easy as well. The entire PowerDNS configuration directory structure is present on all external DNS servers. Usually updating is done by rsyncing from zwinger, but there's no reason why it can't be done from any of the other servers if zwinger is unreachable.
Mark Bergsma wrote:
Brion Vibber wrote:
We also need to make sure that:
- Squid error messages are easily configurable and can be updated with necessary
information.
This is just a matter of overwriting files in /usr/share/squid/errors/ on all squid servers. I think there's a script to do that somewhere, but of course it can be done manually when the Florida cluster is down.
When we last tried this, it didn't work. All of the servers continued to show the old error messages, and no one was able to say why.
Are you sure the path is consistent on all machines, not being overwritten by anything, etc?
- DNS can be easily updated when the Florida cluster is offline, eg so that we
could redirect hits from Florida to another cluster for an error page or read-only mirror.
This is easy as well. The entire PowerDNS configuration directory structure is present on all external DNS servers. Usually updating is done by rsyncing from zwinger, but there's no reason why it can't be done from any of the other servers if zwinger is unreachable.
Great!
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
- Squid error messages are easily configurable and can be updated with necessary
information.
This is just a matter of overwriting files in /usr/share/squid/errors/ on all squid servers. I think there's a script to do that somewhere, but of course it can be done manually when the Florida cluster is down.
When we last tried this, it didn't work. All of the servers continued to show the old error messages, and no one was able to say why.
Are you sure the path is consistent on all machines, not being overwritten by anything, etc?
It's in the Squid RPM anyway, rpm -ql lists these files. It *should* work, but I have never tried it.
Perhaps there are some old files in /usr/local/squid/something lying around that you were trying to overwrite? I think I made sure they were the same during the transition to the RPM, but I'm not entirely sure anymore, and things might have changed since.
wikitech-l@lists.wikimedia.org