Fwd: Another labs outage - curse of the accursed hardware failure continues - Wikitech-l

27 Feb 2015


      FYI
---------- Forwarded message ----------
From: Yuvi Panda yuvipanda@gmail.com
Date: Fri, Feb 27, 2015 at 11:42 AM
Subject: Another labs outage - curse of the accursed hardware failure continues
To: Wikimedia Labs labs-l@lists.wikimedia.org
Hello!
A repeat of the failure that happened a few days ago. Underlying flaky
hardware, andrewbogott is looking into it atm.
== Why is everything so terrible? ==
Labs instances are Virtual Machines that run on physical hardware.
When the underlying hardware dies, the virtual machines on them also
die. This is similar to AWS or other cloud providers. We had one spare
machine (virt1012) in case any of the currently in use machines died
and needed a lifeboat.
A week or so ago one of the machines (virt1005) died, and we migrated
things to virt1012. This week, the new machine, virt1012, has been
having issues, and that's why the outages are all so similar. So the
current instability is basically caused by *two* different
hardware-related issues happening to two different machines with
different configuration.
IT IS A CURSE!
== Making things better? ==
We're adding more hardware. https://phabricator.wikimedia.org/T90783
is the ticket for that.
And specifically for toollabs, it would be awesome for it to be able
to survive one virt* node being down. This is not an easy problem to
solve, but here's the tracking ticket for it:
https://phabricator.wikimedia.org/T90542
Andrew is working through his night (again) to diagnose / fix this
issue (thanks!) and we'll keep you updated as things progress. Thank
you for your patience.
--
Yuvi Panda T
http://yuvi.in/blog
-- 
Yuvi Panda T
http://yuvi.in/blog