[Labs-l] Yet another partial labs outage (Resolved)

Andrew Bogott abogott at wikimedia.org
Mon May 18 17:16:18 UTC 2015


A similar failure just happened on a different compute node.  We're 
researching to see if these two failures were related.

In the meantime -- all hosts are restarting and everything should be up 
within a couple of minutes -- total downtime no more than 10 minutes.  A 
full list of affected instances are at the bottom of this page:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20150518-LabsOutage

-A


On 5/16/15 7:30 AM, Andrew Bogott wrote:
> This turns out to not have been a heating issue, or at least not 
> entirely -- it was some kind of kernel lockup.  Coren and others 
> rebooted the system and restarted all instances, and things seem to be 
> working fine now.  We don't have much explanation for what caused the 
> problem, though, so we'll be on the lookout.
>
> -A
>
>
> On 5/15/15 11:31 PM, Andrew Bogott wrote:
>> The hardware curse continues!
>>
>> One of the labs virt hosts (labvirt1003) is running very hot tonight, 
>> presumably due to a broken fan.  It is intermittently scaling the CPU 
>> speed way back to avoid melting; when that happens there are bound to 
>> be lots of side-effects like unresponsive instances, clock drift, and 
>> the like (not least of which is that right now I can't ssh into the 
>> damn thing, or get performance metrics.)
>>
>> Naturally this started happening late on a Friday, so it may be a 
>> while before I can get someone in the datacenter.  I'm leaving the 
>> host up in the meantime, based on the notion that half a server is 
>> better than none, but poor performance is likely to be the norm in 
>> the meantime.
>>
>> I did shut off one instance:  wikidata-wdq-mm.  I don't have a 
>> personal grudge, but it was gobbling CPU cycles and the system really 
>> needs a rest.  If loss of that instance is a disaster for anyone, 
>> contact me and I'll see if I can revive it and shut off ten or so 
>> other instances to make room.
>>
>> Updates as events warrant!
>>
>> -Andrew
>




More information about the Labs-l mailing list