Database outage postmortem - Wikitech-l

11 Aug 2007

Samuel, the MySQL master server for the s3 cluster encountered some sort
of error a bit after 14:50 UTC today, leading to about a half hour of
intermittently broken access.

The machine is still on the network, but refuses MySQL connections and
drops SSH connections. Before completely losing connectivity to samuel,
I saw many threads in 'opening tables' state in the process list. A disk
error is possible; further diagnosis awaits Rob's next trip into the
data center to fix up samuel and db3.

My first step on discovering there was something quite awry was to put
s3/s3a/default into read-only mode and remove samuel from the server
list, so read-only access could continue during further recovery efforts.

After confirming that the remaining s3 slaves were consistent, I
switched masters to db1 and restored read/write mode.

I encountered a couple of snags during this process. Many apache
processes seemed to be hanging, leading to 'resource unavailable' errors
reported by the squid proxies.

Unfortunately I wasn't able to fully diagnose this  while in the middle
of switching masters, but I might suspect timing-out connections to
samuel and/or adler (which has been down for some time, but was still
listed in the s3a group as the next available server) and/or bogus
wait-for-slave delays.

Graceful apache restarts didn't seem to help much, but a forced restart
(killing old processes) seemed to do the job once I'd resolved the
databases themselves.

The s3 databases are currently humming along happily, though with adler
still out we are down to just one slave in the general s3/default pool
(db5). If we lose one more, we'll lose our redundancy and would have to
take the group into read-only to clone another slave when a server
becomes available.

So it would be nice if we can get another slave back online before
losing one more. :)

The s3a subgroup has one additional slave available (webster).

Software issues:

With the setproctitle extension either disabled or undocumented it's
harder now to tell where the stuck processes are stuck. We should have
an equivalent debugging tool available if possible.

There may be an issue with too-long timeouts either on MySQL connections
or slave waits. Should double-check on this.

-- brion vibber (brion @ wikimedia.org)