I just want to say thank you so much for these emails, they're great on their own, but together they paint a clear picture at a level usually inaccessible for those of us outside everyday mw development. Thank you!
On Sat, Dec 11, 2021 at 20:39 Krinkle krinkle@fastmail.com wrote:
How’d we do in our strive for operational excellence last month? Read on to find out! Incidents
6 documented incidents last month. That's above the two-year and five-year median of 4 per month (per Incident graphs https://codepen.io/Krinkle/full/wbYMZK).
2021-11-04 large file upload timeouts https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-04_large_file_upload_timeouts; Impact: For 9 months, editors were unable to upload large files (e.g. to Commons). Editors would receive generic error messages, typically after a timeout. In retrospect, a dozen different distinct production errors had been reported and regularly observed that were related and provided different clues, however most of these remained untriaged and uninvestigated for months. This may be related to the affected components having no active code steward.
2021-11-05 TOC language converter https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-05_TOC_language_converter; Impact: For 6 hours, wikis experienced a blank or missing table of contents on many pages. For up to 3 days prior, wikis that have multiple language variants (such as Chinese Wikipedia) displayed the table of contents in an incorrect or inconsistent language variant (which are not understandable to some readers).
2021-11-10 cirrussearch commonsfile outage https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-10_cirrussearch_commonsfile_outage; Impact: For ~2.5 hours, the Search results page was unavailable on many wikis (except English Wikipedia). On Wikimedia Commons the search suggestions feature was unresponsive as well.
2021-11-18 codfw ipv6 network https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-18_codfw_ipv6_network; Impact: For 8 minutes, the Codfw cluster experienced partial loss of IPv6 connectivity for upload.wikimedia.org. This did not affect availability of the service because the "Happy Eyeballs https://en.wikipedia.org/wiki/Happy_Eyeballs" algorithm ensures browsers (and other clients) automatically fallback to IPv4. The Codfw cluster generally serves Mexico and parts of the US and Canada. The upload.wikimedia.org service serves photos and other media/document files, such as displayed in Wikipedia articles.
2021-11-23 core network routing https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-23_Core_Network_Routing; Impact: For about 12 minutes, Eqiad was unable to reach hosts in other data centers via public IP addresses. This was due to a BGP routing error. There was no impact on end-user traffic, and impact on internal traffic was limited (only Icinga alerts themselves) because internal traffic generally uses local IP subnets which we currently route with OSPF instead of BGP.
2021-11-25 eventgate-main outage https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-25_eventgate-main_outage; Impact: For about 3 minutes, eventgate-main was down. This resulted in 25,000 MediaWiki backend errors due to inability to queue new jobs. About 1000 user-facing web requests failed (HTTP 500 Error). Event production briefly dropped from ~3000 per second to 0 per second. Incident follow-up
Remember to review and schedule Incident Follow-up work https://phabricator.wikimedia.org/project/view/4758/ in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status https://wikitech.wikimedia.org/wiki/Incident_status on Wikitech.
Recently resolved incident follow-up:
Disable DPL on wikis that aren't using it https://phabricator.wikimedia.org/T287916 Filed after a July 2021 incident, done by Amir (Ladsgroup) and Kunal (Legoktm).
Create easy access to MySQL ports for faster incident response and maintenance https://phabricator.wikimedia.org/T291352 Filed in Sep 2021, and carried out by Stevie (Kormat).
Create paging alert for primary DB hosts https://phabricator.wikimedia.org/T233684 Filed after a Sept 2019 incident, done by Stevie (Kormat).
Trends
November saw 27 new production error reports of which 14 were resolved, and 13 remain open and carry over to the next month.
Of the 301 errors still open from previous months, 16 were resolved. Together with the 13 carried over from November that brings the workboard to 298 unresolved tasks. Figure 1: Unresolved error reports by month https://phabricator.wikimedia.org/phame/post/view/261/production_excellence_38_november_2021/#trends .
Outstanding errors
Take a look at the workboard and look for tasks that could use your help. → https://phabricator.wikimedia.org/tag/wikimedia-production-error/
💡 Did you know: *To find your team's error reports, use the appropriate **"Filter" link in the sidebar of the workboard**.*
Issues carried over from recent months:
Apr 2021: 9 of 42 issues left. May 2021: 16 of 54 issues left. Jun 2021: 9 of 26 issues left. Jul 2021: 11 of 31 issues left. Aug 2021: 10 of 46 issues left. Sep 2021: 10 of 24 issues left. Oct 2021: 20 of 49 issues left. Nov 2021: 13 of 27 new issues https://phabricator.wikimedia.org/maniphest/query/0W0Nuk9umBDc/#R are carried forward.
Thanks!
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
🔗 Share or read later via https://phabricator.wikimedia.org/phame/post/view/261/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/