[QA] Fwd: Re: [Ops] 2015-02-24 Labs outage post-mortem
Greg Grossmeier
greg at wikimedia.org
Fri Feb 27 05:19:27 UTC 2015
FYI, this caused a Beta Cluster outage tonight.
----- Forwarded message from Andrew Bogott <abogott at wikimedia.org> -----
> Date: Thu, 26 Feb 2015 19:12:23 -0800
> From: Andrew Bogott <abogott at wikimedia.org>
> To: Operations Engineers <ops at lists.wikimedia.org>
> Subject: Re: [Ops] 2015-02-24 Labs outage post-mortem
>
> This happened again, just now. I don't have any theory for what's happening
> -- it looks like a software issue except that it's now happened twice on
> virt1012 and 1012 should be identical to 1011 and 1010.
>
> Giuseppe has already had a go at this issue -- I'd appreciate any log-diving
> that anyone else is able to do. Additionally, I'd feel a lot better if we
> could order at least one more server, tomorrow[1]. As it is, even if we
> decide that virt1012 is cursed there's nowhere else for us to go.
>
> [1] Related tickets:
> https://phabricator.wikimedia.org/T90783
> https://phabricator.wikimedia.org/T89752
> https://phabricator.wikimedia.org/T90962
>
>
> On 2/24/15 11:10 AM, Andrew Bogott wrote:
> >We suffered yet another virt outage last night -- this time instance
> >networking failed on virt1012. Awkwardly, virt1012 is where I moved
> >everything from virt1005 during the outage last week, so all the same
> >instances were affected this week as last.
> >
> >The outage report is here:
> >
> >https://wikitech.wikimedia.org/wiki/Incident_documentation/20150224-LabsOutage
> >
> >
> >We didn't learn much from this one -- I welcome your thoughts and
> >additions.
> >
> >-Andrew
> >
>
>
> _______________________________________________
> Ops mailing list
> Ops at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/ops
----- End forwarded message -----
--
| Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |
More information about the QA
mailing list