[Labs-l] IMPORTANT: Many instances slated for reboot and downtime this weekend
Andrew Bogott
abogott at wikimedia.org
Tue Sep 16 15:45:20 UTC 2014
-- Executive Summary:
Many instances will be rebooted at some point this weekend or next
week. The total list of instances subject to reboot is here:
https://wikitech.wikimedia.org/wiki/Virt1006_rebuild
Tools and Beta users can ignore this email.
-- The full story:
Sorry about sending two different IMPORTANT emails this week; we
generally try to keep labs crises to a minimum. Indeed, this email is
about avoiding a potential crisis.
The labs server known as 'virt1006' has been acting poorly lately.
Several times in the last month we've seen instances that live on
virt1006 get into inconsistent states during reboot... they reboot and
never come back up, or they stay in a perpetual 'rebooting' state.
So far we've been able to rescue such instances, but the misbehavior of
a Labs server is very disconcerting. Rather than wait for a full
collapse (and resulting sudden death of 50+ VMs) we've decided to
migrate all instances instances off of virt1006 and then either rebuild
the system or discard the hardware. Moving an instance off of a server
is fairly painless, but it does require a few minutes of downtime and a
reboot.
I've spoken to a few of you directly about the reboots; the affected
Tools and Deployment-prep instances have already been handled. There are
a lot more to go, though. If your instance is stable and has its init
scripts set up properly and a reboot is no big deal, then,
congratulations! Otherwise, please take whatever steps you need to take
to batten down the hatches and get ready for a reboot.
If you need the reboot to happen at a scheduled time while you are
standing by, that's totally fine. In that case please schedule a reboot
window on this page:
https://wikitech.wikimedia.org/wiki/Virt1006_rebuild
Thanks for your cooperation.
-Andrew
More information about the Labs-l
mailing list