Production Excellence #44: May 2022 - Wikitech-l

15 Jun 2022


      How’d we do in our strive for operational excellence last month? Read on to find out!
Incidents 
By golly, we've had quite the month! 10 documented incidents, which is more than three times the two-year median of 3. The last time we experienced ten or more incidents in one month, was June 2019 when we had eleven (Incident graphs https://codepen.io/Krinkle/full/wbYMZK, Excellence monthly of June 2019 https://phabricator.wikimedia.org/phame/post/view/163/production_excellence_12_june_2019/).
I'd like to draw your attention to something positive. As you read the below, take note of incidents that did *not* impact public services, and did *not* have lasting impact or data loss. For example, the Apache incident https://wikitech.wikimedia.org/wiki/Incidents/2022-05-24_Failed_Apache_restart benefited from PyBal's automatic health-based depooling. The deployment server incident https://wikitech.wikimedia.org/wiki/Incidents/2022-05-02_deployment recovered without loss thanks to Bacula. The Etcd incident https://wikitech.wikimedia.org/wiki/Incidents/2022-05-01_etcd impact was limited by serving stale data. And, the Hadoop incident https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure recovered by resuming from Kafka right where it left off.
2022-05-01 etcd https://wikitech.wikimedia.org/wiki/Incidents/2022-05-01_etcd
Impact: For 2 hours, Conftool could not sync Etcd data between our core data centers. Puppet and some other internal services were unavailable or out of sync. The issue was isolated, with no impact on public services.
2022-05-02 deployment server https://wikitech.wikimedia.org/wiki/Incidents/2022-05-02_deployment
Impact: For 4 hours, we could not update or deploy MediaWiki and other services, due to corruption on the active deployment server. No impact on public services.
2022-05-05 site outage https://wikitech.wikimedia.org/wiki/Incidents/2022-05-05_Wikimedia_full_site_outage
Impact: For 20 minutes, all wikis were unreachable for logged-in users and non-cached pages. This was due to a GlobalBlocks schema change causing significant slowdown in a frequent database query.
2022-05-09 Codfw confctl https://wikitech.wikimedia.org/wiki/Incidents/2022-05-09_confctl
Impact: For 5 minutes, all web traffic routed to Codfw received error responses. This affected central USA and South America (local time after midnight). The cause was human error and lack of CLI parameter validation.
2022-05-09 exim-bdat-errors https://wikitech.wikimedia.org/wiki/Incidents/2022-05-09_exim-bdat-errors
Impact: During five days, about 14,000 incoming emails from Gmail users to wikimedia.org were rejected and returned to sender.
2022-05-21 varnish cache busting https://wikitech.wikimedia.org/wiki/Incidents/2022-05-21_varnish_cache_busting
Impact: For 2 minutes, all wikis and services behind our CDN were unavailable to all users.
2022-05-24 failed Apache restart https://wikitech.wikimedia.org/wiki/Incidents/2022-05-24_Failed_Apache_restart
Impact: For 35 minutes, numerous internal services that use Apache on the backend were down. This included Kibana (logstash) and Matomo (piwik). For 20 of those minutes, there was also reduced MediaWiki server capacity, but no measurable end-user impact for wiki traffic.
2022-05-25 de.wikipedia.org https://wikitech.wikimedia.org/wiki/Incidents/2022-05-25_de.wikipedia.org
Impact: For 6 minutes, a portion of logged-in users and non-cached pages experienced a slower response or an error. This was due to increased load on one of the databases.
2022-05-26 m1 database hardware https://wikitech.wikimedia.org/wiki/Incidents/2022-05-26_Database_hardware_failure
Impact: For 12 minutes, internal services hosted on the m1 database (e.g. Etherpad) were unavailable or at reduced capacity.
2022-05-31 Analytics Hadoop failure https://wikitech.wikimedia.org/wiki/Incidents/2022-05-31_Analytics_Data_Lake_-_Hadoop_Namenode_failure
Impact: For 1 hour, all HDFS writes and reads were failing. After recovery, ingestion from Kafka resumed and caught up. No data loss or other lasting impact on the Data Lake.
Incident follow-up 
Recently completed incident follow-up:
Invalid confctl selector should either error out or select nothing https://phabricator.wikimedia.org/T308100
Filed by Amir (@Ladsgroup https://phabricator.wikimedia.org/p/Ladsgroup/) after the confctl incident this past month. Giuseppe (@Joe https://phabricator.wikimedia.org/p/Joe/) implemented CLI parameter validation to prevent human error from causing a similar outage in the future.
Backup opensearch dashboards data https://phabricator.wikimedia.org/T237224
Filed back in 2019 by Filippo (@fgiunchedi https://phabricator.wikimedia.org/p/fgiunchedi/). The OpenSearch homepage dashboard (at logstash.wikimedia.org) was accidentally deleted last month. Bryan (@bd808 https://phabricator.wikimedia.org/p/bd808/) tracked down its content and re-created it. Cole (@colewhite https://phabricator.wikimedia.org/p/colewhite/) and Jaime (@jcrespo https://phabricator.wikimedia.org/p/jcrespo/) worked out a strategy and set up automated backups going forward.
Remember to review and schedule Incident Follow-up work https://phabricator.wikimedia.org/project/view/4758/ in Phabricator! These are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.
💡*Did you know?*: The form on the *Incident status https://wikitech.wikimedia.org/wiki/Incident_status* page now includes a date, to more easily create backdated reports.
Trends 
In May we discovered 28 new production errors https://phabricator.wikimedia.org/maniphest/query/z7vLwJdXtLu2/#R, of which 20 remain unresolved and have come with us to June.
Last month the workboard totalled 292 tasks still open from prior months. Since the last edition, we completed 11 tasks from previous months, gained 11 additional errors from May (some of May was counted in last month), and have 7 fresh errors in the current month of June. As of today, the workboard houses 299 open production error tasks (spreadsheet and graph https://docs.google.com/spreadsheets/d/e/2PACX-1vTrUCAI10hIroYDU-i5_8s7pony8M71ATXrFRiXXV7t5-tITZYrTRLGch-3iJbmeG41ZMcj1vGfzZ70/pubhtml, phab report https://phabricator.wikimedia.org/project/reports/1055/).
Take a look at the workboard and look for tasks that could use your help.
→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Thanks!
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
🔗 Share or read later via https://phabricator.wikimedia.org/phame/post/view/285/