[Labs-l] Partial (but dramatic) labs outage on Tuesday: 2015-02-24 1500UTC-1800UTC

Sun Feb 22 00:57:36 UTC 2015

On 2/20/15 8:07 AM, Ricordisamoa wrote:
> Thank you.
> I (and probably many others) would like someone from the Ops team to 
> elaborate on the uptime and general reliability Labs (especially 
> Tools) is supposed to have, and for what kind of services it is 
> suitable for, to prevent future misunderstandings in regards to loss 
> of important work, etc.
Hello!

I don't want to ignore your question, but I also don't exactly know how 
to answer it.  We're very unlikely to be able to project any kind of 
future uptime percentage, because currently labs runs on few enough 
servers that any attempt to predict uptime by multiplying failure rates 
by server counts would produce such giant error bars as to be useless.

Nonetheless, I can recap our uptime and storage vulnerabilities so that 
you know what to be wary of.

Bad news:

- Each labs instance is stored on a single server.  If any one server is 
destroyed in a catastrophe (e.g. hard-drive crash, blow from a pickaxe, 
etc.) the state of all contained VMs will be suspended or, in extreme 
cases, lost. [1]

- There are three full-time Operations staff-members dedicated to 
supporting labs.  We don't cover all timezones perfectly, and sometimes 
we take weekends and vacations. [2]

- Although the Tools grid engine is distributed among many instances 
(and, consequently, many physical servers), actual tools usage relies on 
several single points of failure, the most obvious of which is the web 
proxy. [3]

- All of labs currently lives in a single datacenter. It's a very 
dependable datacenter, but nonetheless vulnerable to cable cuts, fires, 
and other local disaster scenarios. [4]

Good news:

- Problems like the Ghost vulnerability which mandated a reboot of all 
hardware in late January are very rare.

- The cause of the outage on Tuesday was quite bad (and quite unusual), 
and we nevertheless were able to recover from it without data loss. 
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage

- Yuvi has churned out a ton of great monitoring tools which mean that 
we're ever more aware of and responsive to incidents that might precede 
outages.

- Use of Labs and Tools is growing like crazy!  This means that the Labs 
team is stretched a bit thin rushing to keep up, but I have a hard time 
thinking of this as bad news.

I'm aware that this response is entirely qualitative, and that you might 
prefer some actual quantities and statistics.  I'm not reluctant to 
provide those, but I simply don't know where to begin. If you have any 
specific questions that would help address your particular concerns, 
please don't hesitate to ask.

-Andrew

[1] This is consistent with a 'cattle, not pets' design pattern. For 
example, all tools instances are fully puppetized and any lost instance 
can be replaced with a few minutes' work.  Labs users outside of the 
Tools project should hew closely to this design model as well.  This 
vulnerability could be partially mitigated with something like 
https://phabricator.wikimedia.org/T90364 but that has potential downsides.

Note that data stored on shared NFS servers and in Databases is highly 
redundant and much less subject to destruction.

[2] Potential mitigation for this is obvious, but extremely expensive :(

[3] Theoretical mitigation for this is 
https://phabricator.wikimedia.org/T89995, for which I would welcome a 
Hackathon collaborator

[4] I believe that there are plans in place for backup replication of 
NFS and Database data to a second data center; I will let Coren and Sean 
comment on the specifics.