[Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC

Andrew Bogott abogott at wikimedia.org
Sun Feb 22 02:00:46 UTC 2015


On 2/21/15 5:34 PM, Ricordisamoa wrote:
> Thanks! I know you're doing your best to deal with outages and 
> performance issues.
> Out of curiosity, do you foresee the Foundation allocating some more 
> dedicated people/hardware for Labs?
We have only just added a third full-time engineer, Yuvi.  My preference 
going forward is to distribute labs knowledge more widely through the 
Ops team so that there /many/ more people available to help in a pinch.  
We've been documenting and scripting as much as we can to facilitate 
that... if everything is still falling to just the three of us a few 
months from now then we can start lobbying for a fourth dedicated engineer.

Labs isn't especially constrained by hardware limitations; it's much 
more a question of human bandwidth to adequately manage what hardware we 
have.  The foundation has been quick to fund Labs hardware requests when 
we make them -- the pain is generally in transition and management 
rather than actual limited financial resources.  Case in point: a shiny 
new pile of hard drives is the /cause/ of the outage in the subject line :)

-Andrew

>
> Il 22/02/2015 01:57, Andrew Bogott ha scritto:
>> On 2/20/15 8:07 AM, Ricordisamoa wrote:
>>> Thank you.
>>> I (and probably many others) would like someone from the Ops team to 
>>> elaborate on the uptime and general reliability Labs (especially 
>>> Tools) is supposed to have, and for what kind of services it is 
>>> suitable for, to prevent future misunderstandings in regards to loss 
>>> of important work, etc.
>> Hello!
>>
>> I don't want to ignore your question, but I also don't exactly know 
>> how to answer it.  We're very unlikely to be able to project any kind 
>> of future uptime percentage, because currently labs runs on few 
>> enough servers that any attempt to predict uptime by multiplying 
>> failure rates by server counts would produce such giant error bars as 
>> to be useless.
>>
>> Nonetheless, I can recap our uptime and storage vulnerabilities so 
>> that you know what to be wary of.
>>
>> Bad news:
>>
>> - Each labs instance is stored on a single server.  If any one server 
>> is destroyed in a catastrophe (e.g. hard-drive crash, blow from a 
>> pickaxe, etc.) the state of all contained VMs will be suspended or, 
>> in extreme cases, lost. [1]
>>
>> - There are three full-time Operations staff-members dedicated to 
>> supporting labs.  We don't cover all timezones perfectly, and 
>> sometimes we take weekends and vacations. [2]
>>
>> - Although the Tools grid engine is distributed among many instances 
>> (and, consequently, many physical servers), actual tools usage relies 
>> on several single points of failure, the most obvious of which is the 
>> web proxy. [3]
>>
>> - All of labs currently lives in a single datacenter. It's a very 
>> dependable datacenter, but nonetheless vulnerable to cable cuts, 
>> fires, and other local disaster scenarios. [4]
>>
>> Good news:
>>
>> - Problems like the Ghost vulnerability which mandated a reboot of 
>> all hardware in late January are very rare.
>>
>> - The cause of the outage on Tuesday was quite bad (and quite 
>> unusual), and we nevertheless were able to recover from it without 
>> data loss. 
>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage
>>
>> - Yuvi has churned out a ton of great monitoring tools which mean 
>> that we're ever more aware of and responsive to incidents that might 
>> precede outages.
>>
>> - Use of Labs and Tools is growing like crazy!  This means that the 
>> Labs team is stretched a bit thin rushing to keep up, but I have a 
>> hard time thinking of this as bad news.
>>
>> I'm aware that this response is entirely qualitative, and that you 
>> might prefer some actual quantities and statistics.  I'm not 
>> reluctant to provide those, but I simply don't know where to begin. 
>> If you have any specific questions that would help address your 
>> particular concerns, please don't hesitate to ask.
>>
>> -Andrew
>>
>>
>>
>> [1] This is consistent with a 'cattle, not pets' design pattern. For 
>> example, all tools instances are fully puppetized and any lost 
>> instance can be replaced with a few minutes' work.  Labs users 
>> outside of the Tools project should hew closely to this design model 
>> as well.  This vulnerability could be partially mitigated with 
>> something like https://phabricator.wikimedia.org/T90364 but that has 
>> potential downsides.
>>
>> Note that data stored on shared NFS servers and in Databases is 
>> highly redundant and much less subject to destruction.
>>
>> [2] Potential mitigation for this is obvious, but extremely expensive :(
>>
>> [3] Theoretical mitigation for this is 
>> https://phabricator.wikimedia.org/T89995, for which I would welcome a 
>> Hackathon collaborator
>>
>> [4] I believe that there are plans in place for backup replication of 
>> NFS and Database data to a second data center; I will let Coren and 
>> Sean comment on the specifics.
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l




More information about the Labs-l mailing list