[QA] Fwd: Re: [Ops] 2015-02-24 Labs outage post-mortem

Greg Grossmeier greg at wikimedia.org
Fri Feb 27 05:19:27 UTC 2015


FYI, this caused a Beta Cluster outage tonight.

----- Forwarded message from Andrew Bogott <abogott at wikimedia.org> -----

> Date: Thu, 26 Feb 2015 19:12:23 -0800
> From: Andrew Bogott <abogott at wikimedia.org>
> To: Operations Engineers <ops at lists.wikimedia.org>
> Subject: Re: [Ops] 2015-02-24 Labs outage post-mortem
> 
> This happened again, just now.  I don't have any theory for what's happening
> -- it looks like a software issue except that it's now happened twice on
> virt1012 and 1012 should be identical to 1011 and 1010.
> 
> Giuseppe has already had a go at this issue -- I'd appreciate any log-diving
> that anyone else is able to do.  Additionally, I'd feel a lot better if we
> could order at least one more server, tomorrow[1].  As it is, even if we
> decide that virt1012 is cursed there's nowhere else for us to go.
> 
> [1] Related tickets:
> https://phabricator.wikimedia.org/T90783
> https://phabricator.wikimedia.org/T89752
> https://phabricator.wikimedia.org/T90962
> 
> 
> On 2/24/15 11:10 AM, Andrew Bogott wrote:
> >We suffered yet another virt outage last night -- this time instance
> >networking failed on virt1012.  Awkwardly, virt1012 is where I moved
> >everything from virt1005 during the outage last week, so all the same
> >instances were affected this week as last.
> >
> >The outage report is here:
> >
> >https://wikitech.wikimedia.org/wiki/Incident_documentation/20150224-LabsOutage
> >
> >
> >We didn't learn much from this one -- I welcome your thoughts and
> >additions.
> >
> >-Andrew
> >
> 
> 
> _______________________________________________
> Ops mailing list
> Ops at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/ops

----- End forwarded message -----

-- 
| Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @greg                A18D 1138 8E47 FAC8 1C7D |



More information about the QA mailing list