Hello all,
Today we've successfully migrated our wikis (MediaWiki and associated
services)
from our primary data center (eqiad) to our secondary (codfw), an exercise
we've done for the 3rd year in a row. During the most critical part of the
switch today, the wikis were in read-only mode for a duration of 7 and a
half minutes - a significant improvement from last year.
Although the switchover process itself has been largely automated and went
pretty smoothly once started, we did experience some issues leading up to
our maintenance window, which caused us to delay the switch somewhat:
- In the days before the switch a performance issue in the Translate
extension for CentralNotice had been discovered, which was expected to
cause database stampede issues during the switch, and we decided to
mitigate this by temporarily disabling the
extension for the duration of the switchover process. However it's now
understood that this may have caused some unwanted side effects and should
be avoided in the future in favor of other methods.
- Right before the switchover commenced, an eqiad Varnish server
misbehaved, causing a high spike of failed requests. Thankfully the SRE
Traffic team identified and addressed the issue prompty, allowing the
switchover to proceed.
- Two codfw s7 database slaves crashed right before the start of our
maintenance window. This delayed the start of our switchover procedure by
approximately 30 minutes into our maintenance window as we were
investigating cause and impact.
- The ElasticSearch search cluster traffic did not follow MediaWiki traffic
from eqiad to codfw during the switch as was expected, but stayed in our
primary data center instead. Investigation showed that ElasticSearch had
been manually hardcoded to eqiad in its configuration. This was rectified
after the switchover was complete with a configuration change and manual
switch to codfw.
- After the switchover completed we experienced some repetitive database
load spikes, primarily on the codfw s1 cluster (serving English Wikipedia).
The DBA team performed a series of fine tuning and other corrective actions.
All wikis are now served from our secondary codfw data center, and this is
expected to stay that way for the next 4 weeks, when we will reverse this
procedure.
Should you experience any issue that is deemed related to the switchover
process, please feel free to file a ticket in Phabricator and tag it with
the Datacenter-Switchover-2018 project tag[1]. We will monitor this tag
closely and keep any and all issues updated.
We'd like to thank everyone for their hard work in ensuring any (potential)
issues got resolved timely, for automating the process whenever and
wherever possible, and for making this datacenter switch a success!
[1]
https://phabricator.wikimedia.org/project/profile/3571/
--
Alexandros Kosiaris <akosiaris(a)wikimedia.org>