FYI
---------- Forwarded message ---------- From: Yuvi Panda yuvipanda@gmail.com Date: Fri, Feb 27, 2015 at 11:42 AM Subject: Another labs outage - curse of the accursed hardware failure continues To: Wikimedia Labs labs-l@lists.wikimedia.org
Hello!
A repeat of the failure that happened a few days ago. Underlying flaky hardware, andrewbogott is looking into it atm.
== Why is everything so terrible? ==
Labs instances are Virtual Machines that run on physical hardware. When the underlying hardware dies, the virtual machines on them also die. This is similar to AWS or other cloud providers. We had one spare machine (virt1012) in case any of the currently in use machines died and needed a lifeboat.
A week or so ago one of the machines (virt1005) died, and we migrated things to virt1012. This week, the new machine, virt1012, has been having issues, and that's why the outages are all so similar. So the current instability is basically caused by *two* different hardware-related issues happening to two different machines with different configuration.
IT IS A CURSE!
== Making things better? ==
We're adding more hardware. https://phabricator.wikimedia.org/T90783 is the ticket for that.
And specifically for toollabs, it would be awesome for it to be able to survive one virt* node being down. This is not an easy problem to solve, but here's the tracking ticket for it: https://phabricator.wikimedia.org/T90542
Andrew is working through his night (again) to diagnose / fix this issue (thanks!) and we'll keep you updated as things progress. Thank you for your patience.
-- Yuvi Panda T http://yuvi.in/blog
wikitech-l@lists.wikimedia.org