[Labs-l] Yet another partial labs outage (Resolved)

Wed May 20 07:22:41 UTC 2015

This one has no instance list!!

On Wed, May 20, 2015 at 12:29 AM, Yuvi Panda <yuvipanda at gmail.com> wrote:
> And again: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150519-LabsOutage
>
>
>
> On Mon, May 18, 2015 at 1:16 PM, Andrew Bogott <abogott at wikimedia.org> wrote:
>> A similar failure just happened on a different compute node.  We're
>> researching to see if these two failures were related.
>>
>> In the meantime -- all hosts are restarting and everything should be up
>> within a couple of minutes -- total downtime no more than 10 minutes.  A
>> full list of affected instances are at the bottom of this page:
>>
>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150518-LabsOutage
>>
>> -A
>>
>>
>>
>> On 5/16/15 7:30 AM, Andrew Bogott wrote:
>>>
>>> This turns out to not have been a heating issue, or at least not entirely
>>> -- it was some kind of kernel lockup.  Coren and others rebooted the system
>>> and restarted all instances, and things seem to be working fine now.  We
>>> don't have much explanation for what caused the problem, though, so we'll be
>>> on the lookout.
>>>
>>> -A
>>>
>>>
>>> On 5/15/15 11:31 PM, Andrew Bogott wrote:
>>>>
>>>> The hardware curse continues!
>>>>
>>>> One of the labs virt hosts (labvirt1003) is running very hot tonight,
>>>> presumably due to a broken fan.  It is intermittently scaling the CPU speed
>>>> way back to avoid melting; when that happens there are bound to be lots of
>>>> side-effects like unresponsive instances, clock drift, and the like (not
>>>> least of which is that right now I can't ssh into the damn thing, or get
>>>> performance metrics.)
>>>>
>>>> Naturally this started happening late on a Friday, so it may be a while
>>>> before I can get someone in the datacenter.  I'm leaving the host up in the
>>>> meantime, based on the notion that half a server is better than none, but
>>>> poor performance is likely to be the norm in the meantime.
>>>>
>>>> I did shut off one instance:  wikidata-wdq-mm.  I don't have a personal
>>>> grudge, but it was gobbling CPU cycles and the system really needs a rest.
>>>> If loss of that instance is a disaster for anyone, contact me and I'll see
>>>> if I can revive it and shut off ten or so other instances to make room.
>>>>
>>>> Updates as events warrant!
>>>>
>>>> -Andrew
>>>
>>>
>>
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
>
> --
> Yuvi Panda T
> http://yuvi.in/blog
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l