[Labs-l] Yet another partial labs outage (Resolved)

Petr Bena benapetr at gmail.com
Sat May 16 12:59:30 UTC 2015


Ok can you give us a list of instances that were rebooted so that I
don't have to check one by one my instances if they rebooted or not?
Thanks

On Sat, May 16, 2015 at 2:30 PM, Andrew Bogott <abogott at wikimedia.org> wrote:
> This turns out to not have been a heating issue, or at least not entirely --
> it was some kind of kernel lockup.  Coren and others rebooted the system and
> restarted all instances, and things seem to be working fine now.  We don't
> have much explanation for what caused the problem, though, so we'll be on
> the lookout.
>
> -A
>
>
> On 5/15/15 11:31 PM, Andrew Bogott wrote:
>>
>> The hardware curse continues!
>>
>> One of the labs virt hosts (labvirt1003) is running very hot tonight,
>> presumably due to a broken fan.  It is intermittently scaling the CPU speed
>> way back to avoid melting; when that happens there are bound to be lots of
>> side-effects like unresponsive instances, clock drift, and the like (not
>> least of which is that right now I can't ssh into the damn thing, or get
>> performance metrics.)
>>
>> Naturally this started happening late on a Friday, so it may be a while
>> before I can get someone in the datacenter.  I'm leaving the host up in the
>> meantime, based on the notion that half a server is better than none, but
>> poor performance is likely to be the norm in the meantime.
>>
>> I did shut off one instance:  wikidata-wdq-mm.  I don't have a personal
>> grudge, but it was gobbling CPU cycles and the system really needs a rest.
>> If loss of that instance is a disaster for anyone, contact me and I'll see
>> if I can revive it and shut off ten or so other instances to make room.
>>
>> Updates as events warrant!
>>
>> -Andrew
>
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l



More information about the Labs-l mailing list