Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap

14 Sep 2018


      On Thu, Sep 13, 2018 at 7:49 AM Bryan Davis bd808@wikimedia.org wrote:
...
Everyone involved worked hard to make this happen, but I'd like to
give a special shout out to Giuseppe Lavagetto for taking the time to
follow up on a VisualEditor problem that affected Wikitech
(https://phabricator.wikimedia.org/T163438). We noticed during the
April 2017 switchover that the client side code for VE was failing to
communicate with the backend component while the wikis were being
served from the Dallas datacenter. We guessed that this was a
configuration error of some sort, but did not take the time to debug
in depth. When the issue reoccurred during the current datacenter
switch, Giuseppe took a deep dive into the code and configuration,
identified the configuration difference that triggered the problem,
and made a patch for the Parsoid backend that fixes Wikitech.
While I'm flattered by the compliments, I think it's fair to underline the
problem was partly caused by a patch I made to Parsoid some time ago. So I
mostly cleaned up a problem I caused - does this count for getting a new
t-shirt, even if I fixed it with more than one year of delay? :P
On the other hand, I want to join the choir praising the work that has been
done for the switchover, and take the time to list all the things we've
done collectively to make it as uneventful and fast (read-only time was
less than 8 minutes this time) as it was:
- Mediawiki now fetches its read-only state and which datacenter is the
master from etcd, eliminating the need for a code deployment
- We now connect to our per-datacenter distributed cache via mcrouter,
which allows us to keep the caches in various datacenters consistent. This
eliminated the need to wipe the cache during the read-only phase, thus
resulting in a big reduction in the time we went to read-only
- Our old jobqueue not only gave me innumerable debugging nightmares, but
was hard and tricky to handle in a multi-datacenter environment. We have
substituted it with a more modern system which needed no intervention
during the switchover
- Our media storage system (Swift + thumbor) is now active-active and we
write and read from both datacenters
- We created a framework for easily automate complex orchestration tasks
(like a switchover) called "spicerack", which will benefit our operations
in general and has the potential to reduce the toil on the SRE team, while
proven, automated procedures can be coded for most events.
- Last but not least, the Dallas datacenter (codenamed "codfw") needed
little to no tuning when we moved all traffic, and we had to fix virtually
nothing that went out of sync during the last 1.4 years. I know this might
sound unimpressive, but keeping a datacenter that's not really used in good
shape and in sync is a huge accomplishment in itself; I've never seen
before such a show of flawless execution and collective discipline.
So I want to congratulate everyone who was involved in the process, that
includes most of the people on the core platform, performance, search and
SRE teams, but a special personal thanks goes to
- The whole SRE team, and really anyone working on our production
environment, for keeping the Dallas datacenter in good shape for more than
a year, so that we didn't need to adjust almost anything pre or
post-switchover Alexandros and Riccardo for driving most of the process and
allowing me to care about the switchover for less than a week before it
happened and, yes, to take the time to fix that bug too :)
Cheers,
Giuseppe
P.S. I'm sure I forgot someone / something amazing we've done; I apologize
in advance.
-- 
Giuseppe Lavagetto
Principal Site Reliability Engineer, Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] [Wmfall] Datacenter Switchover recap