Hi,
Today we switched over most services and traffic caches from the eqiad (Virginia) datacenter to codfw (Texas) as part of improving our reliability. The goal is to have this procedure working and regularly tested in case of an emergency when we actually need it.
We're only aware of one user-facing impact, for a short time WDQS lag detection was broken, affecting Wikidata bots that check it. This is tracked as https://phabricator.wikimedia.org/T285710.
Users will experience a bit of a latency increase for now as most user traffic will need to talk to both eqiad and codfw datacenters. This will go away tomorrow once MediaWiki is switched over (keep reading).
Also, we were a bit delayed in starting today because of an issue causing appservers to get stuck: https://phabricator.wikimedia.org/T285634.
== Services == Started at 14:29 UTC, officially finished at 15:09.
The main issues we ran into were: * the helm-charts service is unique and doesn't have a service IP, causing the automatic switchover verification to break. This required us to manually check the other services that come after it in the list, and then re-run cookbook while excluding it. Tracked as https://phabricator.wikimedia.org/T285707. * the restbase-async service has some special handling, which we debated on whether to follow that or not, opted to not special case it. Figuring out what to do long-term is https://phabricator.wikimedia.org/T285711. * the WDQS issue mentioned earlier.
== Traffic == Started at 15:43, finished at 15:45.
It took until ~16:25 for eqiad to mostly depool. There's not much else to report, it went very smoothly.
== Tomorrow's MediaWiki switchover == Scheduled for 14:00 UTC https://zonestamp.toolforge.org/1624888854.
It is our goal to minimize the read-only time and make this a non-event from a user perspective.
All of the coordination will take place in the #wikimedia-operations IRC channel on Libera Chat You're more than welcome to follow along but if you have questions, please ask them in #wikimedia-tech so it doesn't get disruptive. The procedure that we'll be following is documented at https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki.
I'm planning to do one more "live test" later today, will announce that on IRC when it gets started.
-- Kunal
Just wanted to emphasize that this is a great effort, and a huge step towards improving the current reliability of our services. We should do more of this, broader and more exhaustive.
Kudos!
On 06/28 12:33, Kunal Mehta wrote:
Hi,
Today we switched over most services and traffic caches from the eqiad (Virginia) datacenter to codfw (Texas) as part of improving our reliability. The goal is to have this procedure working and regularly tested in case of an emergency when we actually need it.
We're only aware of one user-facing impact, for a short time WDQS lag detection was broken, affecting Wikidata bots that check it. This is tracked as https://phabricator.wikimedia.org/T285710.
Users will experience a bit of a latency increase for now as most user traffic will need to talk to both eqiad and codfw datacenters. This will go away tomorrow once MediaWiki is switched over (keep reading).
Also, we were a bit delayed in starting today because of an issue causing appservers to get stuck: https://phabricator.wikimedia.org/T285634.
== Services == Started at 14:29 UTC, officially finished at 15:09.
The main issues we ran into were:
- the helm-charts service is unique and doesn't have a service IP, causing
the automatic switchover verification to break. This required us to manually check the other services that come after it in the list, and then re-run cookbook while excluding it. Tracked as https://phabricator.wikimedia.org/T285707.
- the restbase-async service has some special handling, which we debated on
whether to follow that or not, opted to not special case it. Figuring out what to do long-term is https://phabricator.wikimedia.org/T285711.
- the WDQS issue mentioned earlier.
== Traffic == Started at 15:43, finished at 15:45.
It took until ~16:25 for eqiad to mostly depool. There's not much else to report, it went very smoothly.
== Tomorrow's MediaWiki switchover == Scheduled for 14:00 UTC https://zonestamp.toolforge.org/1624888854.
It is our goal to minimize the read-only time and make this a non-event from a user perspective.
All of the coordination will take place in the #wikimedia-operations IRC channel on Libera Chat You're more than welcome to follow along but if you have questions, please ask them in #wikimedia-tech so it doesn't get disruptive. The procedure that we'll be following is documented at https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki.
I'm planning to do one more "live test" later today, will announce that on IRC when it gets started.
-- Kunal _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Hi again,
Today we switched MediaWiki from our eqiad datacenter to codfw. In total there was 1 minute 57 seconds of read-only time, which is basically what we were aiming for.
We really only had one user-facing issue in that tr.wikivoyage.org was inaccessible for a few minutes because of a typo. https://phabricator.wikimedia.org/T260297 tracks making sure it doesn't happen again.
Other than that, there's not much to report, it went pretty smoothly. The rest of the bugs/issues filed as a result of today's switchover are at https://phabricator.wikimedia.org/T281515#7185775, most are related to improving the automation around the switch itself.
We've noticed that MediaWiki in codfw is slightly faster, most likely because of newer hardware. Now that eqiad isn't serving traffic, we plan on installing new hardware there too: https://phabricator.wikimedia.org/T279309.
Thanks to everyone who participated in today's switchover and for all the efforts and work ahead of time to make today so smooth.
We will be switching back to eqiad sometime in August, more details to come as we get closer.
-- Kunal
wikitech-l@lists.wikimedia.org