[Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC

Sun Feb 22 01:34:16 UTC 2015

Thanks! I know you're doing your best to deal with outages and 
performance issues.
Out of curiosity, do you foresee the Foundation allocating some more 
dedicated people/hardware for Labs?

Il 22/02/2015 01:57, Andrew Bogott ha scritto:
> On 2/20/15 8:07 AM, Ricordisamoa wrote:
>> Thank you.
>> I (and probably many others) would like someone from the Ops team to 
>> elaborate on the uptime and general reliability Labs (especially 
>> Tools) is supposed to have, and for what kind of services it is 
>> suitable for, to prevent future misunderstandings in regards to loss 
>> of important work, etc.
> Hello!
>
> I don't want to ignore your question, but I also don't exactly know 
> how to answer it.  We're very unlikely to be able to project any kind 
> of future uptime percentage, because currently labs runs on few enough 
> servers that any attempt to predict uptime by multiplying failure 
> rates by server counts would produce such giant error bars as to be 
> useless.
>
> Nonetheless, I can recap our uptime and storage vulnerabilities so 
> that you know what to be wary of.
>
> Bad news:
>
> - Each labs instance is stored on a single server.  If any one server 
> is destroyed in a catastrophe (e.g. hard-drive crash, blow from a 
> pickaxe, etc.) the state of all contained VMs will be suspended or, in 
> extreme cases, lost. [1]
>
> - There are three full-time Operations staff-members dedicated to 
> supporting labs.  We don't cover all timezones perfectly, and 
> sometimes we take weekends and vacations. [2]
>
> - Although the Tools grid engine is distributed among many instances 
> (and, consequently, many physical servers), actual tools usage relies 
> on several single points of failure, the most obvious of which is the 
> web proxy. [3]
>
> - All of labs currently lives in a single datacenter. It's a very 
> dependable datacenter, but nonetheless vulnerable to cable cuts, 
> fires, and other local disaster scenarios. [4]
>
> Good news:
>
> - Problems like the Ghost vulnerability which mandated a reboot of all 
> hardware in late January are very rare.
>
> - The cause of the outage on Tuesday was quite bad (and quite 
> unusual), and we nevertheless were able to recover from it without 
> data loss. 
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage
>
> - Yuvi has churned out a ton of great monitoring tools which mean that 
> we're ever more aware of and responsive to incidents that might 
> precede outages.
>
> - Use of Labs and Tools is growing like crazy!  This means that the 
> Labs team is stretched a bit thin rushing to keep up, but I have a 
> hard time thinking of this as bad news.
>
> I'm aware that this response is entirely qualitative, and that you 
> might prefer some actual quantities and statistics.  I'm not reluctant 
> to provide those, but I simply don't know where to begin. If you have 
> any specific questions that would help address your particular 
> concerns, please don't hesitate to ask.
>
> -Andrew
>
>
>
> [1] This is consistent with a 'cattle, not pets' design pattern. For 
> example, all tools instances are fully puppetized and any lost 
> instance can be replaced with a few minutes' work.  Labs users outside 
> of the Tools project should hew closely to this design model as well.  
> This vulnerability could be partially mitigated with something like 
> https://phabricator.wikimedia.org/T90364 but that has potential 
> downsides.
>
> Note that data stored on shared NFS servers and in Databases is highly 
> redundant and much less subject to destruction.
>
> [2] Potential mitigation for this is obvious, but extremely expensive :(
>
> [3] Theoretical mitigation for this is 
> https://phabricator.wikimedia.org/T89995, for which I would welcome a 
> Hackathon collaborator
>
> [4] I believe that there are plans in place for backup replication of 
> NFS and Database data to a second data center; I will let Coren and 
> Sean comment on the specifics.
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l