Samuel, the MySQL master server for the s3 cluster encountered some sort of error a bit after 14:50 UTC today, leading to about a half hour of intermittently broken access.
The machine is still on the network, but refuses MySQL connections and drops SSH connections. Before completely losing connectivity to samuel, I saw many threads in 'opening tables' state in the process list. A disk error is possible; further diagnosis awaits Rob's next trip into the data center to fix up samuel and db3.
My first step on discovering there was something quite awry was to put s3/s3a/default into read-only mode and remove samuel from the server list, so read-only access could continue during further recovery efforts.
After confirming that the remaining s3 slaves were consistent, I switched masters to db1 and restored read/write mode.
I encountered a couple of snags during this process. Many apache processes seemed to be hanging, leading to 'resource unavailable' errors reported by the squid proxies.
Unfortunately I wasn't able to fully diagnose this while in the middle of switching masters, but I might suspect timing-out connections to samuel and/or adler (which has been down for some time, but was still listed in the s3a group as the next available server) and/or bogus wait-for-slave delays.
Graceful apache restarts didn't seem to help much, but a forced restart (killing old processes) seemed to do the job once I'd resolved the databases themselves.
The s3 databases are currently humming along happily, though with adler still out we are down to just one slave in the general s3/default pool (db5). If we lose one more, we'll lose our redundancy and would have to take the group into read-only to clone another slave when a server becomes available.
So it would be nice if we can get another slave back online before losing one more. :)
The s3a subgroup has one additional slave available (webster).
Software issues:
With the setproctitle extension either disabled or undocumented it's harder now to tell where the stuck processes are stuck. We should have an equivalent debugging tool available if possible.
There may be an issue with too-long timeouts either on MySQL connections or slave waits. Should double-check on this.
-- brion vibber (brion @ wikimedia.org)