[Labs-announce] Mild but long-running Tools outage in process, resolved

Andrew Bogott abogott at wikimedia.org
Fri Jun 30 02:43:16 UTC 2017

The kernel roll-back was a success, and things are now behaving reasonably.

At some point we'll get a proper incident report together.  The short 
version of the story is: NFS performance was shockingly bad on the new 
kernel, as illustrated by the attached ridiculous graph.

Once we have a modern, less-broken kernel we'll need to try this all 
over again, but that won't happen right away and the update window will 
be pre-announced.

Thanks for bearing with us through all this!  Most services seem to have 
survived this last round of chaos but you might want to check your sites 
and restart services as needed.


On 6/29/17 8:25 PM, Andrew Bogott wrote:
> After various failed measures, we're now trying to revert back to the 
> older kernel and switching back between NFS servers yet again.  So 
> Tools NFS (and various associated services) will probably break, at 
> least for a few minutes.
> With luck this will get us into a stable place, but I'll update again 
> regardless.
> -Andrew
> On 6/29/17 3:27 PM, Andrew Bogott wrote:
>>     The tools cluster is suffering from several maladies right now. 
>> Existing services seem to be mostly fine, but any kubernetes services 
>> that tried to restart in the last few hours probably failed to start, 
>> and new things are still failing to start.  Similarly, web services 
>> and other tools are failing to restart in several cases.
>>     There are various theories as to what's going on -- most likely 
>> it's a kernel-version incompatibility with the newly upgraded NFS 
>> server.  There was an earlier ldap outage which is better understood 
>> and should be resolved by now.
>>     We apologize for the inconvenience, and are working frantically 
>> to restore stability.  There will be a follow-up email when things 
>> are resolved.
>> -Andrew

-------------- next part --------------
A non-text attachment was scrubbed...
Name: nfsperformance.png
Type: image/png
Size: 104115 bytes
Desc: not available
URL: <https://lists.wikimedia.org/pipermail/labs-announce/attachments/20170629/2820266a/attachment-0001.png>

More information about the Labs-announce mailing list