[Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC
Ricordisamoa
ricordisamoa at openmailbox.org
Sun Feb 22 01:34:16 UTC 2015
Thanks! I know you're doing your best to deal with outages and
performance issues.
Out of curiosity, do you foresee the Foundation allocating some more
dedicated people/hardware for Labs?
Il 22/02/2015 01:57, Andrew Bogott ha scritto:
> On 2/20/15 8:07 AM, Ricordisamoa wrote:
>> Thank you.
>> I (and probably many others) would like someone from the Ops team to
>> elaborate on the uptime and general reliability Labs (especially
>> Tools) is supposed to have, and for what kind of services it is
>> suitable for, to prevent future misunderstandings in regards to loss
>> of important work, etc.
> Hello!
>
> I don't want to ignore your question, but I also don't exactly know
> how to answer it. We're very unlikely to be able to project any kind
> of future uptime percentage, because currently labs runs on few enough
> servers that any attempt to predict uptime by multiplying failure
> rates by server counts would produce such giant error bars as to be
> useless.
>
> Nonetheless, I can recap our uptime and storage vulnerabilities so
> that you know what to be wary of.
>
> Bad news:
>
> - Each labs instance is stored on a single server. If any one server
> is destroyed in a catastrophe (e.g. hard-drive crash, blow from a
> pickaxe, etc.) the state of all contained VMs will be suspended or, in
> extreme cases, lost. [1]
>
> - There are three full-time Operations staff-members dedicated to
> supporting labs. We don't cover all timezones perfectly, and
> sometimes we take weekends and vacations. [2]
>
> - Although the Tools grid engine is distributed among many instances
> (and, consequently, many physical servers), actual tools usage relies
> on several single points of failure, the most obvious of which is the
> web proxy. [3]
>
> - All of labs currently lives in a single datacenter. It's a very
> dependable datacenter, but nonetheless vulnerable to cable cuts,
> fires, and other local disaster scenarios. [4]
>
> Good news:
>
> - Problems like the Ghost vulnerability which mandated a reboot of all
> hardware in late January are very rare.
>
> - The cause of the outage on Tuesday was quite bad (and quite
> unusual), and we nevertheless were able to recover from it without
> data loss.
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage
>
> - Yuvi has churned out a ton of great monitoring tools which mean that
> we're ever more aware of and responsive to incidents that might
> precede outages.
>
> - Use of Labs and Tools is growing like crazy! This means that the
> Labs team is stretched a bit thin rushing to keep up, but I have a
> hard time thinking of this as bad news.
>
> I'm aware that this response is entirely qualitative, and that you
> might prefer some actual quantities and statistics. I'm not reluctant
> to provide those, but I simply don't know where to begin. If you have
> any specific questions that would help address your particular
> concerns, please don't hesitate to ask.
>
> -Andrew
>
>
>
> [1] This is consistent with a 'cattle, not pets' design pattern. For
> example, all tools instances are fully puppetized and any lost
> instance can be replaced with a few minutes' work. Labs users outside
> of the Tools project should hew closely to this design model as well.
> This vulnerability could be partially mitigated with something like
> https://phabricator.wikimedia.org/T90364 but that has potential
> downsides.
>
> Note that data stored on shared NFS servers and in Databases is highly
> redundant and much less subject to destruction.
>
> [2] Potential mitigation for this is obvious, but extremely expensive :(
>
> [3] Theoretical mitigation for this is
> https://phabricator.wikimedia.org/T89995, for which I would welcome a
> Hackathon collaborator
>
> [4] I believe that there are plans in place for backup replication of
> NFS and Database data to a second data center; I will let Coren and
> Sean comment on the specifics.
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
More information about the Labs-l
mailing list