[Labs-l] Yet another partial labs outage (Resolved)

Yuvi Panda yuvipanda at gmail.com
Tue May 19 22:29:22 UTC 2015


And again: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150519-LabsOutage



On Mon, May 18, 2015 at 1:16 PM, Andrew Bogott <abogott at wikimedia.org> wrote:
> A similar failure just happened on a different compute node.  We're
> researching to see if these two failures were related.
>
> In the meantime -- all hosts are restarting and everything should be up
> within a couple of minutes -- total downtime no more than 10 minutes.  A
> full list of affected instances are at the bottom of this page:
>
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150518-LabsOutage
>
> -A
>
>
>
> On 5/16/15 7:30 AM, Andrew Bogott wrote:
>>
>> This turns out to not have been a heating issue, or at least not entirely
>> -- it was some kind of kernel lockup.  Coren and others rebooted the system
>> and restarted all instances, and things seem to be working fine now.  We
>> don't have much explanation for what caused the problem, though, so we'll be
>> on the lookout.
>>
>> -A
>>
>>
>> On 5/15/15 11:31 PM, Andrew Bogott wrote:
>>>
>>> The hardware curse continues!
>>>
>>> One of the labs virt hosts (labvirt1003) is running very hot tonight,
>>> presumably due to a broken fan.  It is intermittently scaling the CPU speed
>>> way back to avoid melting; when that happens there are bound to be lots of
>>> side-effects like unresponsive instances, clock drift, and the like (not
>>> least of which is that right now I can't ssh into the damn thing, or get
>>> performance metrics.)
>>>
>>> Naturally this started happening late on a Friday, so it may be a while
>>> before I can get someone in the datacenter.  I'm leaving the host up in the
>>> meantime, based on the notion that half a server is better than none, but
>>> poor performance is likely to be the norm in the meantime.
>>>
>>> I did shut off one instance:  wikidata-wdq-mm.  I don't have a personal
>>> grudge, but it was gobbling CPU cycles and the system really needs a rest.
>>> If loss of that instance is a disaster for anyone, contact me and I'll see
>>> if I can revive it and shut off ten or so other instances to make room.
>>>
>>> Updates as events warrant!
>>>
>>> -Andrew
>>
>>
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l



-- 
Yuvi Panda T
http://yuvi.in/blog



More information about the Labs-l mailing list