[Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC
Andrew Bogott
abogott at wikimedia.org
Sun Feb 22 00:57:36 UTC 2015
On 2/20/15 8:07 AM, Ricordisamoa wrote:
> Thank you.
> I (and probably many others) would like someone from the Ops team to
> elaborate on the uptime and general reliability Labs (especially
> Tools) is supposed to have, and for what kind of services it is
> suitable for, to prevent future misunderstandings in regards to loss
> of important work, etc.
Hello!
I don't want to ignore your question, but I also don't exactly know how
to answer it. We're very unlikely to be able to project any kind of
future uptime percentage, because currently labs runs on few enough
servers that any attempt to predict uptime by multiplying failure rates
by server counts would produce such giant error bars as to be useless.
Nonetheless, I can recap our uptime and storage vulnerabilities so that
you know what to be wary of.
Bad news:
- Each labs instance is stored on a single server. If any one server is
destroyed in a catastrophe (e.g. hard-drive crash, blow from a pickaxe,
etc.) the state of all contained VMs will be suspended or, in extreme
cases, lost. [1]
- There are three full-time Operations staff-members dedicated to
supporting labs. We don't cover all timezones perfectly, and sometimes
we take weekends and vacations. [2]
- Although the Tools grid engine is distributed among many instances
(and, consequently, many physical servers), actual tools usage relies on
several single points of failure, the most obvious of which is the web
proxy. [3]
- All of labs currently lives in a single datacenter. It's a very
dependable datacenter, but nonetheless vulnerable to cable cuts, fires,
and other local disaster scenarios. [4]
Good news:
- Problems like the Ghost vulnerability which mandated a reboot of all
hardware in late January are very rare.
- The cause of the outage on Tuesday was quite bad (and quite unusual),
and we nevertheless were able to recover from it without data loss.
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage
- Yuvi has churned out a ton of great monitoring tools which mean that
we're ever more aware of and responsive to incidents that might precede
outages.
- Use of Labs and Tools is growing like crazy! This means that the Labs
team is stretched a bit thin rushing to keep up, but I have a hard time
thinking of this as bad news.
I'm aware that this response is entirely qualitative, and that you might
prefer some actual quantities and statistics. I'm not reluctant to
provide those, but I simply don't know where to begin. If you have any
specific questions that would help address your particular concerns,
please don't hesitate to ask.
-Andrew
[1] This is consistent with a 'cattle, not pets' design pattern. For
example, all tools instances are fully puppetized and any lost instance
can be replaced with a few minutes' work. Labs users outside of the
Tools project should hew closely to this design model as well. This
vulnerability could be partially mitigated with something like
https://phabricator.wikimedia.org/T90364 but that has potential downsides.
Note that data stored on shared NFS servers and in Databases is highly
redundant and much less subject to destruction.
[2] Potential mitigation for this is obvious, but extremely expensive :(
[3] Theoretical mitigation for this is
https://phabricator.wikimedia.org/T89995, for which I would welcome a
Hackathon collaborator
[4] I believe that there are plans in place for backup replication of
NFS and Database data to a second data center; I will let Coren and Sean
comment on the specifics.
More information about the Labs-l
mailing list