Hello all,
Today we've successfully migrated our wikis (MediaWiki and associated services) from our primary data center (eqiad) to our secondary (codfw), an exercise we've done for the 3rd year in a row. During the most critical part of the switch today, the wikis were in read-only mode for a duration of 7 and a half minutes - a significant improvement from last year.
Although the switchover process itself has been largely automated and went pretty smoothly once started, we did experience some issues leading up to our maintenance window, which caused us to delay the switch somewhat:
- In the days before the switch a performance issue in the Translate extension for CentralNotice had been discovered, which was expected to cause database stampede issues during the switch, and we decided to mitigate this by temporarily disabling the extension for the duration of the switchover process. However it's now understood that this may have caused some unwanted side effects and should be avoided in the future in favor of other methods.
- Right before the switchover commenced, an eqiad Varnish server misbehaved, causing a high spike of failed requests. Thankfully the SRE Traffic team identified and addressed the issue prompty, allowing the switchover to proceed.
- Two codfw s7 database slaves crashed right before the start of our maintenance window. This delayed the start of our switchover procedure by approximately 30 minutes into our maintenance window as we were investigating cause and impact.
- The ElasticSearch search cluster traffic did not follow MediaWiki traffic from eqiad to codfw during the switch as was expected, but stayed in our primary data center instead. Investigation showed that ElasticSearch had been manually hardcoded to eqiad in its configuration. This was rectified after the switchover was complete with a configuration change and manual switch to codfw.
- After the switchover completed we experienced some repetitive database load spikes, primarily on the codfw s1 cluster (serving English Wikipedia). The DBA team performed a series of fine tuning and other corrective actions.
All wikis are now served from our secondary codfw data center, and this is expected to stay that way for the next 4 weeks, when we will reverse this procedure.
Should you experience any issue that is deemed related to the switchover process, please feel free to file a ticket in Phabricator and tag it with the Datacenter-Switchover-2018 project tag[1]. We will monitor this tag closely and keep any and all issues updated.
We'd like to thank everyone for their hard work in ensuring any (potential) issues got resolved timely, for automating the process whenever and wherever possible, and for making this datacenter switch a success!
[1] https://phabricator.wikimedia.org/project/profile/3571/