[Labs-l] Production status of labs (wsa Re: Reboot of virt11 Friday Sept 6 at 20:00 UTC)

Sun Sep 8 21:26:52 UTC 2013

On Sun, Sep 8, 2013 at 9:50 AM, Merlijn van Deen <valhallasw at arctus.nl>wrote:

> On 8 September 2013 00:18, Marc A. Pelletier <marc at uberbox.org> wrote:
>
>> You're missing the point of the distinction between "production",
>>  "semi-production" and "not production" -- the difference is exactly how
>> many 9's of availability we are gunning for, and how many people are
>> woken up by an outage to ensure that.
>>
>
> I think Maarten is well aware of this difference. On the contrary: the
> discussion started with Maarten asking the following:
>
>  > How long will he downtime be and can you please announce earlier? A
> week is a normal notice time.
> > The Wiki Loves Monuments tools and applications (like the mobile app)
> rely on this so please keep it as short as possible.
>
> That is not asking for 99.999% uptime. It's asking for the same heads-up
> time we had - and still have! - for the Toolserver. It's asking for more
> than one evening to prepare for possible downtime: the downtime was
> announced less than 48 hours in advance, around midnight CEST.
>
> That being said, I would like to repeat Maarten's question: can we
> *please* get information on what is going to happen, when, and how long the
> expected downtime is, preferrable about a week in advance?
>
>
The maintenance was announced a full three days in advance. I gave a list
of affected instances and Coren gave information about how tools was going
to be affected. I'll give a week of notice if possible, but I don't believe
there was anything wrong with the information provided.

The problem with notice in this specific situation was that the host needed
a reboot because it was having IO issues related to the system not
releasing some mounted filesystems. So, there was a tradeoff of giving
longer notice and having performance issues for all instances on the host
or less notice and fixing the performance issues. I chose to fix
performance problems since it was going to cause a relatively short amount
of downtime on a service that is not to depended on for production services.

- Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wikimedia.org/pipermail/labs-l/attachments/20130908/449972dc/attachment.html>