Hello all,
Today we've successfully migrated our wikis (MediaWiki and associated services) from our primary data center (eqiad) to our secondary (codfw), an exercise we've done for the 3rd year in a row. During the most critical part of the switch today, the wikis were in read-only mode for a duration of 7 and a half minutes - a significant improvement from last year.
Although the switchover process itself has been largely automated and went pretty smoothly once started, we did experience some issues leading up to our maintenance window, which caused us to delay the switch somewhat:
- In the days before the switch a performance issue in the Translate extension for CentralNotice had been discovered, which was expected to cause database stampede issues during the switch, and we decided to mitigate this by temporarily disabling the extension for the duration of the switchover process. However it's now understood that this may have caused some unwanted side effects and should be avoided in the future in favor of other methods.
- Right before the switchover commenced, an eqiad Varnish server misbehaved, causing a high spike of failed requests. Thankfully the SRE Traffic team identified and addressed the issue prompty, allowing the switchover to proceed.
- Two codfw s7 database slaves crashed right before the start of our maintenance window. This delayed the start of our switchover procedure by approximately 30 minutes into our maintenance window as we were investigating cause and impact.
- The ElasticSearch search cluster traffic did not follow MediaWiki traffic from eqiad to codfw during the switch as was expected, but stayed in our primary data center instead. Investigation showed that ElasticSearch had been manually hardcoded to eqiad in its configuration. This was rectified after the switchover was complete with a configuration change and manual switch to codfw.
- After the switchover completed we experienced some repetitive database load spikes, primarily on the codfw s1 cluster (serving English Wikipedia). The DBA team performed a series of fine tuning and other corrective actions.
All wikis are now served from our secondary codfw data center, and this is expected to stay that way for the next 4 weeks, when we will reverse this procedure.
Should you experience any issue that is deemed related to the switchover process, please feel free to file a ticket in Phabricator and tag it with the Datacenter-Switchover-2018 project tag[1]. We will monitor this tag closely and keep any and all issues updated.
We'd like to thank everyone for their hard work in ensuring any (potential) issues got resolved timely, for automating the process whenever and wherever possible, and for making this datacenter switch a success!
[1] https://phabricator.wikimedia.org/project/profile/3571/
<quote name="Alexandros Kosiaris" date="2018-09-12" time="20:16:02 +0300">
Hello all,
Today we've successfully migrated our wikis (MediaWiki and associated services) from our primary data center (eqiad) to our secondary (codfw)
Well done, all.
Greg
Seriously -- this is some complicated, difficult stuff that one day may be critical to keeping our projects available to everyone (but let's hope not)
Well done indeed!
-Toby
On Wed, Sep 12, 2018 at 12:53 PM, Greg Grossmeier greg@wikimedia.org wrote:
<quote name="Alexandros Kosiaris" date="2018-09-12" time="20:16:02 +0300"> > Hello all, > > Today we've successfully migrated our wikis (MediaWiki and associated > services) > from our primary data center (eqiad) to our secondary (codfw)
Well done, all.
Greg
-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | Release Team Manager A18D 1138 8E47 FAC8 1C7D |
Wmfall mailing list Wmfall@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfall
+!
Great, steady progress!
Best,
Victoria
On Sep 12, 2018, at 3:54 PM, Toby Negrin tnegrin@wikimedia.org wrote:
Seriously -- this is some complicated, difficult stuff that one day may be critical to keeping our projects available to everyone (but let's hope not)
Well done indeed!
-Toby
On Wed, Sep 12, 2018 at 12:53 PM, Greg Grossmeier <greg@wikimedia.org mailto:greg@wikimedia.org> wrote:
<quote name="Alexandros Kosiaris" date="2018-09-12" time="20:16:02 +0300"> > Hello all, > > Today we've successfully migrated our wikis (MediaWiki and associated > services) > from our primary data center (eqiad) to our secondary (codfw)
Well done, all.
Greg
-- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | Release Team Manager A18D 1138 8E47 FAC8 1C7D |
Wmfall mailing list Wmfall@lists.wikimedia.org mailto:Wmfall@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfall https://lists.wikimedia.org/mailman/listinfo/wmfall
Wmfall mailing list Wmfall@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfall
On Wed, Sep 12, 2018 at 11:16 AM, Alexandros Kosiaris akosiaris@wikimedia.org wrote:
Hello all,
Today we've successfully migrated our wikis (MediaWiki and associated services) from our primary data center (eqiad) to our secondary (codfw), an exercise we've done for the 3rd year in a row. During the most critical part of the switch today, the wikis were in read-only mode for a duration of 7 and a half minutes - a significant improvement from last year.
Everyone involved worked hard to make this happen, but I'd like to give a special shout out to Giuseppe Lavagetto for taking the time to follow up on a VisualEditor problem that affected Wikitech (https://phabricator.wikimedia.org/T163438). We noticed during the April 2017 switchover that the client side code for VE was failing to communicate with the backend component while the wikis were being served from the Dallas datacenter. We guessed that this was a configuration error of some sort, but did not take the time to debug in depth. When the issue reoccurred during the current datacenter switch, Giuseppe took a deep dive into the code and configuration, identified the configuration difference that triggered the problem, and made a patch for the Parsoid backend that fixes Wikitech.
Wikitech is a low volume wiki for both edits and reads, and for various historical and technical reasons is different from all other wikis that we host. Keeping it available for reading is important to our technical teams because it hosts many of the troubleshooting playbooks that we use to diagnose and correct operational problems on the rest of the wikis. Taking the time to work on an editing bug that only impacted edits done using VisualEditor is awesome, but not the sort of thing I would normally expect to be worked on promptly. For me, Giuseppe's work on this bug is a sign that that he cares about the small details, and also that the rest of the switchover went well giving him the time to investigate lower impact edge cases like this.
Bryan
Congratulations, awesome work!
On Thu, Sep 13, 2018 at 7:49 AM Bryan Davis bd808@wikimedia.org wrote:
On Wed, Sep 12, 2018 at 11:16 AM, Alexandros Kosiaris akosiaris@wikimedia.org wrote:
Hello all,
Today we've successfully migrated our wikis (MediaWiki and associated services) from our primary data center (eqiad) to our secondary (codfw), an
exercise
we've done for the 3rd year in a row. During the most critical part of
the
switch today, the wikis were in read-only mode for a duration of 7 and a half minutes - a significant improvement from last year.
Everyone involved worked hard to make this happen, but I'd like to give a special shout out to Giuseppe Lavagetto for taking the time to follow up on a VisualEditor problem that affected Wikitech (https://phabricator.wikimedia.org/T163438). We noticed during the April 2017 switchover that the client side code for VE was failing to communicate with the backend component while the wikis were being served from the Dallas datacenter. We guessed that this was a configuration error of some sort, but did not take the time to debug in depth. When the issue reoccurred during the current datacenter switch, Giuseppe took a deep dive into the code and configuration, identified the configuration difference that triggered the problem, and made a patch for the Parsoid backend that fixes Wikitech.
Wikitech is a low volume wiki for both edits and reads, and for various historical and technical reasons is different from all other wikis that we host. Keeping it available for reading is important to our technical teams because it hosts many of the troubleshooting playbooks that we use to diagnose and correct operational problems on the rest of the wikis. Taking the time to work on an editing bug that only impacted edits done using VisualEditor is awesome, but not the sort of thing I would normally expect to be worked on promptly. For me, Giuseppe's work on this bug is a sign that that he cares about the small details, and also that the rest of the switchover went well giving him the time to investigate lower impact edge cases like this.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wmfall mailing list Wmfall@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfall
Thank you Bryan and thank you Giuseppe. It is terrific to hear of such good work and even better to have it celebrated! Proud of you both!
Victoria
On Sep 13, 2018, at 1:49 AM, Bryan Davis bd808@wikimedia.org wrote:
On Wed, Sep 12, 2018 at 11:16 AM, Alexandros Kosiaris akosiaris@wikimedia.org wrote:
Hello all,
Today we've successfully migrated our wikis (MediaWiki and associated services) from our primary data center (eqiad) to our secondary (codfw), an exercise we've done for the 3rd year in a row. During the most critical part of the switch today, the wikis were in read-only mode for a duration of 7 and a half minutes - a significant improvement from last year.
Everyone involved worked hard to make this happen, but I'd like to give a special shout out to Giuseppe Lavagetto for taking the time to follow up on a VisualEditor problem that affected Wikitech (https://phabricator.wikimedia.org/T163438). We noticed during the April 2017 switchover that the client side code for VE was failing to communicate with the backend component while the wikis were being served from the Dallas datacenter. We guessed that this was a configuration error of some sort, but did not take the time to debug in depth. When the issue reoccurred during the current datacenter switch, Giuseppe took a deep dive into the code and configuration, identified the configuration difference that triggered the problem, and made a patch for the Parsoid backend that fixes Wikitech.
Wikitech is a low volume wiki for both edits and reads, and for various historical and technical reasons is different from all other wikis that we host. Keeping it available for reading is important to our technical teams because it hosts many of the troubleshooting playbooks that we use to diagnose and correct operational problems on the rest of the wikis. Taking the time to work on an editing bug that only impacted edits done using VisualEditor is awesome, but not the sort of thing I would normally expect to be worked on promptly. For me, Giuseppe's work on this bug is a sign that that he cares about the small details, and also that the rest of the switchover went well giving him the time to investigate lower impact edge cases like this.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wmfall mailing list Wmfall@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfall
On Thu, Sep 13, 2018 at 7:49 AM Bryan Davis bd808@wikimedia.org wrote:
Everyone involved worked hard to make this happen, but I'd like to give a special shout out to Giuseppe Lavagetto for taking the time to follow up on a VisualEditor problem that affected Wikitech (https://phabricator.wikimedia.org/T163438). We noticed during the April 2017 switchover that the client side code for VE was failing to communicate with the backend component while the wikis were being served from the Dallas datacenter. We guessed that this was a configuration error of some sort, but did not take the time to debug in depth. When the issue reoccurred during the current datacenter switch, Giuseppe took a deep dive into the code and configuration, identified the configuration difference that triggered the problem, and made a patch for the Parsoid backend that fixes Wikitech.
While I'm flattered by the compliments, I think it's fair to underline the problem was partly caused by a patch I made to Parsoid some time ago. So I mostly cleaned up a problem I caused - does this count for getting a new t-shirt, even if I fixed it with more than one year of delay? :P
On the other hand, I want to join the choir praising the work that has been done for the switchover, and take the time to list all the things we've done collectively to make it as uneventful and fast (read-only time was less than 8 minutes this time) as it was: - Mediawiki now fetches its read-only state and which datacenter is the master from etcd, eliminating the need for a code deployment - We now connect to our per-datacenter distributed cache via mcrouter, which allows us to keep the caches in various datacenters consistent. This eliminated the need to wipe the cache during the read-only phase, thus resulting in a big reduction in the time we went to read-only - Our old jobqueue not only gave me innumerable debugging nightmares, but was hard and tricky to handle in a multi-datacenter environment. We have substituted it with a more modern system which needed no intervention during the switchover - Our media storage system (Swift + thumbor) is now active-active and we write and read from both datacenters - We created a framework for easily automate complex orchestration tasks (like a switchover) called "spicerack", which will benefit our operations in general and has the potential to reduce the toil on the SRE team, while proven, automated procedures can be coded for most events. - Last but not least, the Dallas datacenter (codenamed "codfw") needed little to no tuning when we moved all traffic, and we had to fix virtually nothing that went out of sync during the last 1.4 years. I know this might sound unimpressive, but keeping a datacenter that's not really used in good shape and in sync is a huge accomplishment in itself; I've never seen before such a show of flawless execution and collective discipline.
So I want to congratulate everyone who was involved in the process, that includes most of the people on the core platform, performance, search and SRE teams, but a special personal thanks goes to - The whole SRE team, and really anyone working on our production environment, for keeping the Dallas datacenter in good shape for more than a year, so that we didn't need to adjust almost anything pre or post-switchover Alexandros and Riccardo for driving most of the process and allowing me to care about the switchover for less than a week before it happened and, yes, to take the time to fix that bug too :)
Cheers,
Giuseppe P.S. I'm sure I forgot someone / something amazing we've done; I apologize in advance.
Sorry for the copy/paste fail, I meant
So I want to congratulate everyone who was involved in the process, that includes most of the people on the core platform, performance, search and SRE teams, but a special personal thanks goes to Alexandros and Riccardo for driving most of the process and allowing me to care about the switchover for less than a week before it happened and, yes, to take the time to fix that bug too :)
Cheers,
Giuseppe
This is great!
Thank you to everyone involved, for the really important work that you are all doing, and thanks to Alexandros, Timo & Giuseppe for sharing the highlights. It's great to know that so many pieces can come together in just 8 minutes. This really is an impressive (and important!) accomplishment. You've set the bar so high that it'll be a real challenge* to do it any better next year!
* A challenge which I have no doubt will lead to many more improvements to the infrastructure between now and the next DC-switchover.
On Fri, Sep 14, 2018 at 2:18 AM Giuseppe Lavagetto glavagetto@wikimedia.org wrote:
Sorry for the copy/paste fail, I meant
So I want to congratulate everyone who was involved in the process, that includes most of the people on the core platform, performance, search and SRE teams, but a special personal thanks goes to Alexandros and Riccardo for driving most of the process and allowing me to care about the switchover for less than a week before it happened and, yes, to take the time to fix that bug too :)
Cheers,
Giuseppe
Giuseppe Lavagetto Principal Site Reliability Engineer, Wikimedia Foundation _______________________________________________ Wmfall mailing list Wmfall@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfall
wikitech-l@lists.wikimedia.org