[Labs-l] Yet another partial labs outage (Resolved)
Andrew Bogott
abogott at wikimedia.org
Mon May 18 17:16:18 UTC 2015
A similar failure just happened on a different compute node. We're
researching to see if these two failures were related.
In the meantime -- all hosts are restarting and everything should be up
within a couple of minutes -- total downtime no more than 10 minutes. A
full list of affected instances are at the bottom of this page:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150518-LabsOutage
-A
On 5/16/15 7:30 AM, Andrew Bogott wrote:
> This turns out to not have been a heating issue, or at least not
> entirely -- it was some kind of kernel lockup. Coren and others
> rebooted the system and restarted all instances, and things seem to be
> working fine now. We don't have much explanation for what caused the
> problem, though, so we'll be on the lookout.
>
> -A
>
>
> On 5/15/15 11:31 PM, Andrew Bogott wrote:
>> The hardware curse continues!
>>
>> One of the labs virt hosts (labvirt1003) is running very hot tonight,
>> presumably due to a broken fan. It is intermittently scaling the CPU
>> speed way back to avoid melting; when that happens there are bound to
>> be lots of side-effects like unresponsive instances, clock drift, and
>> the like (not least of which is that right now I can't ssh into the
>> damn thing, or get performance metrics.)
>>
>> Naturally this started happening late on a Friday, so it may be a
>> while before I can get someone in the datacenter. I'm leaving the
>> host up in the meantime, based on the notion that half a server is
>> better than none, but poor performance is likely to be the norm in
>> the meantime.
>>
>> I did shut off one instance: wikidata-wdq-mm. I don't have a
>> personal grudge, but it was gobbling CPU cycles and the system really
>> needs a rest. If loss of that instance is a disaster for anyone,
>> contact me and I'll see if I can revive it and shut off ten or so
>> other instances to make room.
>>
>> Updates as events warrant!
>>
>> -Andrew
>
More information about the Labs-l
mailing list