[Engineering] Data center switch-over moving ahead next week: please stay available :)

Mark Bergsma mark at wikimedia.org
Tue Apr 19 16:00:33 UTC 2016


Hi all,

Today the data center switch-over commenced as planned, and has just fully
completed successfully. We are now serving our sites from codfw (Dallas,
Texas) for the next 2 days if all stays well.

We switched the wikis to read-only (editing disabled) at 14:02 UTC, and
went back read-write at 14:48 UTC - a little longer than planned. While
edits were possible then, unfortunately at that time Special:Recent Changes
(and related change feeds) were not yet working due to an unexpected
configuration problem with our Redis servers until 15:10 UTC, when we found
and fixed the issue. The site has stayed up and available for readers
throughout the entire migration.

Overall the procedure was a success with few problems along the way.
However we've also carefully kept track of any issues and delays we
encountered for evaluation to improve and speed up the procedure, and
reducing impact to our users - some of which will already be implemented
for our switch back on Thursday.

We're still expecting to find (possibly subtle) issues today, and would
like everyone who notices anything to use the following channels to report
them:

1. File a Phabricator issue with project #codfw-rollout
2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent)
3. Send an e-mail to the Operations list: ops at lists.wikimedia.org

We're not done yet, but thanks to all who have helped so far. :-)

Mark


On Fri, Apr 15, 2016 at 2:56 PM, Mark Bergsma <mark at wikimedia.org> wrote:

> Hi all,
>
> As previously announced[1], our data center switch-over test is planned to
> happen next week, on Tuesday April 19th, with the switch back two days
> later, Thursday April 21st. It's looking good, and unless any major new
> obstacles arise, we'll be moving forward with it.
>
> A request:
>
> The Technology team would highly appreciate it if everyone in Engineering
> with knowledge of or responsible for any software/service/extension running
> in production, could keep an eye on things during these 3 days, and also
> stay reachable by phone just in case of need.
>
>
> With such a large migration with lots of components it's always possible
> that we find unanticipated issues during or after the fail-overs. Certain
> features/services could fail unexpectedly, sometimes subtly so, e.g. due to
> unforeseen traffic patterns from configuration mistakes, ACL/permission
> mismatches, etc etc. We'll certainly try to correct these issues, but in
> some cases we may need or benefit from your support/knowledge/patches, and
> may want to reach you. Because we can't detect all issues immediately, it
> would also be helpful if you'd keep an eye out for any site issues and
> report any regressions.
>
> If we need to reach you urgently by phone, we'll typically do so using the
> phone number provided on Office Wiki's Contact List[2]. Be aware that some
> phone numbers in this list have been corrupted in the past during automated
> edits, or may be outdated. Therefore, please check if your phone number
> listed there is still correct.
>
> The actual switch-overs begin on Tuesday, 19 April at 14:00 UTC and
> Thursday, 21 April at 14:00 UTC, respectively. Any changes to this schedule
> will be noted on our Wikitech calendar[3].
>
> To report any issues, please use one of the following channels:
>
> 1. File a Phabricator issue with project #codfw-rollout
> 2. Report issues on IRC: Freenode channel #wikimedia-tech (if urgent, or
> during the migration)
> 3. Send an e-mail to the Operations list: ops at lists.wikimedia.org (any
> time)
>
> Thanks!
>
> [1] http://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/
> [2] https://office.wikimedia.org/wiki/Contact_list
> [3]
> https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Schedule_for_Q3_FY2015-2016_rollout
>
> --
> Mark Bergsma <mark at wikimedia.org>
> Lead Operations Architect
> Director of Technical Operations
> Wikimedia Foundation
>



-- 
Mark Bergsma <mark at wikimedia.org>
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160419/6e7a6a55/attachment.html>


More information about the Engineering mailing list