[Labs-l] Unplanned outage in progress (resolved)

Andrew Bogott abogott at wikimedia.org
Tue Feb 24 08:39:56 UTC 2015


On 2/24/15 12:11 AM, Yuvi Panda wrote:
> Thanks to some heroics from Andrew Bogott, the server is back up now
> and everything should be functioning normally. Let me know if any
> tools need restarting.
The quick (and not yet entirely confirmed) explanation for this outage 
is "kernel bug on virt1012 caused all instances to lose network 
connectivity."

This was solved by a restart, but that of course necessitated rebooting 
all the instances on the box.  A complete list of affected instances is 
here:

https://phabricator.wikimedia.org/P326

Painfully, virt1012 was the box that I evacuated virt1005 to when /it/ 
died last week.  So essentially every instance that rebooted last week 
was rebooted again today, hence the near-identical tools outage.

If virt1012 had waited just a few more hours to die, then it would've 
died during the scheduled outage tomorrow and no one would've noticed :(

-A



More information about the Labs-l mailing list