[Labs-l] Unplanned outage in progress (resolved)
Andrew Bogott
abogott at wikimedia.org
Tue Feb 24 08:39:56 UTC 2015
On 2/24/15 12:11 AM, Yuvi Panda wrote:
> Thanks to some heroics from Andrew Bogott, the server is back up now
> and everything should be functioning normally. Let me know if any
> tools need restarting.
The quick (and not yet entirely confirmed) explanation for this outage
is "kernel bug on virt1012 caused all instances to lose network
connectivity."
This was solved by a restart, but that of course necessitated rebooting
all the instances on the box. A complete list of affected instances is
here:
https://phabricator.wikimedia.org/P326
Painfully, virt1012 was the box that I evacuated virt1005 to when /it/
died last week. So essentially every instance that rebooted last week
was rebooted again today, hence the near-identical tools outage.
If virt1012 had waited just a few more hours to die, then it would've
died during the scheduled outage tomorrow and no one would've noticed :(
-A
More information about the Labs-l
mailing list