Hi,
Today we switched over most services and traffic caches from the eqiad (Virginia) datacenter to codfw (Texas) as part of improving our reliability. The goal is to have this procedure working and regularly tested in case of an emergency when we actually need it.
We're only aware of one user-facing impact, for a short time WDQS lag detection was broken, affecting Wikidata bots that check it. This is tracked as https://phabricator.wikimedia.org/T285710.
Users will experience a bit of a latency increase for now as most user traffic will need to talk to both eqiad and codfw datacenters. This will go away tomorrow once MediaWiki is switched over (keep reading).
Also, we were a bit delayed in starting today because of an issue causing appservers to get stuck: https://phabricator.wikimedia.org/T285634.
== Services == Started at 14:29 UTC, officially finished at 15:09.
The main issues we ran into were: * the helm-charts service is unique and doesn't have a service IP, causing the automatic switchover verification to break. This required us to manually check the other services that come after it in the list, and then re-run cookbook while excluding it. Tracked as https://phabricator.wikimedia.org/T285707. * the restbase-async service has some special handling, which we debated on whether to follow that or not, opted to not special case it. Figuring out what to do long-term is https://phabricator.wikimedia.org/T285711. * the WDQS issue mentioned earlier.
== Traffic == Started at 15:43, finished at 15:45.
It took until ~16:25 for eqiad to mostly depool. There's not much else to report, it went very smoothly.
== Tomorrow's MediaWiki switchover == Scheduled for 14:00 UTC https://zonestamp.toolforge.org/1624888854.
It is our goal to minimize the read-only time and make this a non-event from a user perspective.
All of the coordination will take place in the #wikimedia-operations IRC channel on Libera Chat You're more than welcome to follow along but if you have questions, please ask them in #wikimedia-tech so it doesn't get disruptive. The procedure that we'll be following is documented at https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki.
I'm planning to do one more "live test" later today, will announce that on IRC when it gets started.
-- Kunal