Hello everyone,
Today we've concluded the successful migration of our wikis (MediaWiki and associated services) from our secondary datacenter (codfw) back to the primary one (eqiad). During the most critical part of the switch today, the wikis were in read-only mode for a duration of 4 minutes and 41 seconds. That's a significant improvement over the 7 mins and 34 seconds we achieved during the inverse process we concluded a month ago, which was already significantly better than last year. I 'd like to believe that it's the result of the increasing amount of experience we are building and trust we are putting in the process and tools that we have developed for this.
Although the switchback process itself has been largely automated and went pretty smoothly, there have been some issues that we experienced:
- CentralNotice banners stayed online for a longer time than necessary due to miscommunication issues. This has now been documented and will be avoided in the future.
- After the switchback we 've experienced increased load to all our mediawiki application servers. The root cause has been identified and mitigation against it will be put in place. The summary is non working replication of parsercache between the 2 datacenters.
- Last, but not least and probably the most important of all issues, a data inconsistency was detected in wikidata (s8). Namely some articles that were present in codfw but were not replicated in eqiad. We are still investigating the root cause of this while applying corrective actions to mitigate the user impact as quickly as possible.
All wikis are now served from our primary data center again.
Should you experience any issue that is deemed related to the switchover process, please feel free to file a ticket in Phabricator and tag it with the Datacenter-Switchover-2018 project tag[1]. We will monitor this tag closely and keep any and all issues updated.
We'd like to thank everyone for their hard work in ensuring any (potential) issues got resolved timely, for automating the process whenever and wherever possible, and for making this datacenter switchover and switchback a success!
A minor correction:
During the most critical part of the switch today, the wikis were in read-only mode for a duration of 4 minutes and 41 seconds.
This was yesterday, not today.
wikitech-l@lists.wikimedia.org