[Labs-l] IMPORTANT: Many instances slated for reboot and downtime this weekend (finished)
Andrew Bogott
abogott at wikimedia.org
Sat Sep 20 20:39:03 UTC 2014
This is done now, and all the affected instances are up and running.
-Andrew
> On 9/16/14 10:45 AM, Andrew Bogott wrote:
>> -- Executive Summary:
>>
>> Many instances will be rebooted at some point this weekend or next
>> week. The total list of instances subject to reboot is here:
>>
>> https://wikitech.wikimedia.org/wiki/Virt1006_rebuild
>>
>> Tools and Beta users can ignore this email.
>>
>>
>> -- The full story:
>>
>> Sorry about sending two different IMPORTANT emails this week; we
>> generally try to keep labs crises to a minimum. Indeed, this email
>> is about avoiding a potential crisis.
>>
>> The labs server known as 'virt1006' has been acting poorly lately.
>> Several times in the last month we've seen instances that live on
>> virt1006 get into inconsistent states during reboot... they reboot
>> and never come back up, or they stay in a perpetual 'rebooting' state.
>>
>> So far we've been able to rescue such instances, but the misbehavior
>> of a Labs server is very disconcerting. Rather than wait for a full
>> collapse (and resulting sudden death of 50+ VMs) we've decided to
>> migrate all instances instances off of virt1006 and then either
>> rebuild the system or discard the hardware. Moving an instance off of
>> a server is fairly painless, but it does require a few minutes of
>> downtime and a reboot.
>>
>> I've spoken to a few of you directly about the reboots; the affected
>> Tools and Deployment-prep instances have already been handled. There
>> are a lot more to go, though. If your instance is stable and has its
>> init scripts set up properly and a reboot is no big deal, then,
>> congratulations! Otherwise, please take whatever steps you need to
>> take to batten down the hatches and get ready for a reboot.
>>
>> If you need the reboot to happen at a scheduled time while you are
>> standing by, that's totally fine. In that case please schedule a
>> reboot window on this page:
>>
>> https://wikitech.wikimedia.org/wiki/Virt1006_rebuild
>>
>> Thanks for your cooperation.
>>
>> -Andrew
>
More information about the Labs-l
mailing list