On Thu, Sep 13, 2018 at 7:49 AM Bryan Davis bd808@wikimedia.org wrote:
Everyone involved worked hard to make this happen, but I'd like to give a special shout out to Giuseppe Lavagetto for taking the time to follow up on a VisualEditor problem that affected Wikitech (https://phabricator.wikimedia.org/T163438). We noticed during the April 2017 switchover that the client side code for VE was failing to communicate with the backend component while the wikis were being served from the Dallas datacenter. We guessed that this was a configuration error of some sort, but did not take the time to debug in depth. When the issue reoccurred during the current datacenter switch, Giuseppe took a deep dive into the code and configuration, identified the configuration difference that triggered the problem, and made a patch for the Parsoid backend that fixes Wikitech.
While I'm flattered by the compliments, I think it's fair to underline the problem was partly caused by a patch I made to Parsoid some time ago. So I mostly cleaned up a problem I caused - does this count for getting a new t-shirt, even if I fixed it with more than one year of delay? :P
On the other hand, I want to join the choir praising the work that has been done for the switchover, and take the time to list all the things we've done collectively to make it as uneventful and fast (read-only time was less than 8 minutes this time) as it was: - Mediawiki now fetches its read-only state and which datacenter is the master from etcd, eliminating the need for a code deployment - We now connect to our per-datacenter distributed cache via mcrouter, which allows us to keep the caches in various datacenters consistent. This eliminated the need to wipe the cache during the read-only phase, thus resulting in a big reduction in the time we went to read-only - Our old jobqueue not only gave me innumerable debugging nightmares, but was hard and tricky to handle in a multi-datacenter environment. We have substituted it with a more modern system which needed no intervention during the switchover - Our media storage system (Swift + thumbor) is now active-active and we write and read from both datacenters - We created a framework for easily automate complex orchestration tasks (like a switchover) called "spicerack", which will benefit our operations in general and has the potential to reduce the toil on the SRE team, while proven, automated procedures can be coded for most events. - Last but not least, the Dallas datacenter (codenamed "codfw") needed little to no tuning when we moved all traffic, and we had to fix virtually nothing that went out of sync during the last 1.4 years. I know this might sound unimpressive, but keeping a datacenter that's not really used in good shape and in sync is a huge accomplishment in itself; I've never seen before such a show of flawless execution and collective discipline.
So I want to congratulate everyone who was involved in the process, that includes most of the people on the core platform, performance, search and SRE teams, but a special personal thanks goes to - The whole SRE team, and really anyone working on our production environment, for keeping the Dallas datacenter in good shape for more than a year, so that we didn't need to adjust almost anything pre or post-switchover Alexandros and Riccardo for driving most of the process and allowing me to care about the switchover for less than a week before it happened and, yes, to take the time to fix that bug too :)
Cheers,
Giuseppe P.S. I'm sure I forgot someone / something amazing we've done; I apologize in advance.