[Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC
Andrew Bogott
abogott at wikimedia.org
Sun Feb 22 02:00:46 UTC 2015
On 2/21/15 5:34 PM, Ricordisamoa wrote:
> Thanks! I know you're doing your best to deal with outages and
> performance issues.
> Out of curiosity, do you foresee the Foundation allocating some more
> dedicated people/hardware for Labs?
We have only just added a third full-time engineer, Yuvi. My preference
going forward is to distribute labs knowledge more widely through the
Ops team so that there /many/ more people available to help in a pinch.
We've been documenting and scripting as much as we can to facilitate
that... if everything is still falling to just the three of us a few
months from now then we can start lobbying for a fourth dedicated engineer.
Labs isn't especially constrained by hardware limitations; it's much
more a question of human bandwidth to adequately manage what hardware we
have. The foundation has been quick to fund Labs hardware requests when
we make them -- the pain is generally in transition and management
rather than actual limited financial resources. Case in point: a shiny
new pile of hard drives is the /cause/ of the outage in the subject line :)
-Andrew
>
> Il 22/02/2015 01:57, Andrew Bogott ha scritto:
>> On 2/20/15 8:07 AM, Ricordisamoa wrote:
>>> Thank you.
>>> I (and probably many others) would like someone from the Ops team to
>>> elaborate on the uptime and general reliability Labs (especially
>>> Tools) is supposed to have, and for what kind of services it is
>>> suitable for, to prevent future misunderstandings in regards to loss
>>> of important work, etc.
>> Hello!
>>
>> I don't want to ignore your question, but I also don't exactly know
>> how to answer it. We're very unlikely to be able to project any kind
>> of future uptime percentage, because currently labs runs on few
>> enough servers that any attempt to predict uptime by multiplying
>> failure rates by server counts would produce such giant error bars as
>> to be useless.
>>
>> Nonetheless, I can recap our uptime and storage vulnerabilities so
>> that you know what to be wary of.
>>
>> Bad news:
>>
>> - Each labs instance is stored on a single server. If any one server
>> is destroyed in a catastrophe (e.g. hard-drive crash, blow from a
>> pickaxe, etc.) the state of all contained VMs will be suspended or,
>> in extreme cases, lost. [1]
>>
>> - There are three full-time Operations staff-members dedicated to
>> supporting labs. We don't cover all timezones perfectly, and
>> sometimes we take weekends and vacations. [2]
>>
>> - Although the Tools grid engine is distributed among many instances
>> (and, consequently, many physical servers), actual tools usage relies
>> on several single points of failure, the most obvious of which is the
>> web proxy. [3]
>>
>> - All of labs currently lives in a single datacenter. It's a very
>> dependable datacenter, but nonetheless vulnerable to cable cuts,
>> fires, and other local disaster scenarios. [4]
>>
>> Good news:
>>
>> - Problems like the Ghost vulnerability which mandated a reboot of
>> all hardware in late January are very rare.
>>
>> - The cause of the outage on Tuesday was quite bad (and quite
>> unusual), and we nevertheless were able to recover from it without
>> data loss.
>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage
>>
>> - Yuvi has churned out a ton of great monitoring tools which mean
>> that we're ever more aware of and responsive to incidents that might
>> precede outages.
>>
>> - Use of Labs and Tools is growing like crazy! This means that the
>> Labs team is stretched a bit thin rushing to keep up, but I have a
>> hard time thinking of this as bad news.
>>
>> I'm aware that this response is entirely qualitative, and that you
>> might prefer some actual quantities and statistics. I'm not
>> reluctant to provide those, but I simply don't know where to begin.
>> If you have any specific questions that would help address your
>> particular concerns, please don't hesitate to ask.
>>
>> -Andrew
>>
>>
>>
>> [1] This is consistent with a 'cattle, not pets' design pattern. For
>> example, all tools instances are fully puppetized and any lost
>> instance can be replaced with a few minutes' work. Labs users
>> outside of the Tools project should hew closely to this design model
>> as well. This vulnerability could be partially mitigated with
>> something like https://phabricator.wikimedia.org/T90364 but that has
>> potential downsides.
>>
>> Note that data stored on shared NFS servers and in Databases is
>> highly redundant and much less subject to destruction.
>>
>> [2] Potential mitigation for this is obvious, but extremely expensive :(
>>
>> [3] Theoretical mitigation for this is
>> https://phabricator.wikimedia.org/T89995, for which I would welcome a
>> Hackathon collaborator
>>
>> [4] I believe that there are plans in place for backup replication of
>> NFS and Database data to a second data center; I will let Coren and
>> Sean comment on the specifics.
>>
>> _______________________________________________
>> Labs-l mailing list
>> Labs-l at lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
More information about the Labs-l
mailing list