How’d we do in our strive for operational excellence last month? Read on to find out!
Incidents
3 documented incidents last month.
2022-02-01 ulsfo network
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-02-01_ulsfo_network>
Impact: For 3 minutes, clients served by the ulsfo POP were not able to contribute or
display un-cached pages.
2022-02-22 wdqs updater codfw
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-02-22_wdqs_updater_codfw>
Impact: For 2 hours, WDQS updates failed to be processed. Most bots and tools were unable
to edit Wikidata during this time.
2022-02-22 vrts
<https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-02-22_vrts>
Impact: For 12 hours, incoming emails to a specific recently created VRTS queue were not
processed with senders receiving a bounce with an SMTP 550 Error.
See also Incident graphs <https://codepen.io/Krinkle/full/wbYMZK>.
Incident follow-up
Remember to review and schedule Incident Follow-up work
<https://phabricator.wikimedia.org/project/view/4758/> in Phabricator, which are
preventive measures and tech debt mitigations written down after an incident is concluded.
Read about past incidents at Incident status
<https://wikitech.wikimedia.org/wiki/Incident_status> on Wikitech.
Recently conducted incident follow-up:
Create a dashboard for Prometheus metrics about health of Prometheus itself.
<https://phabricator.wikimedia.org/T222102>
Pitched by CDanis after an April 2019 incident, carried by Filippo (@fgiunchedi).
Improve wording around AbuseFilter messages about throttling functionality.
<https://phabricator.wikimedia.org/T200036>
Originally filed in 2018. This came up last month during an incident where the wording
may've led to a misunderstanding. Now resolved by @Daimona.
Exclude restart procedure from automated Elasticsearch provisioning.
<https://phabricator.wikimedia.org/T290902>
There can be too much automation. Filed after an incident last September. Fixed by
@RKemper.
Outstanding errors
Take a look at the workboard and look for tasks that could use your help.
→
https://phabricator.wikimedia.org/tag/wikimedia-production-error/
I skip breakdowns most months as each breakdown has its flaws. However, I hear people find
them useful, so I'll try to do them from time to time with my noted caveats. The last
breakdown was in the December edition
<https://phabricator.wikimedia.org/phame/post/view/265/production_excellence_39_december_2021/>,
which focussed on throughput during a typical month. Important to recognise is that
neither high nor low throughput is per-se good or bad. It's good when issues are
detected, reported, and triaged correctly. It's also good if a team's components
are stable and don't produce any errors. A report may be found to be invalid or a
duplicate, which is sometimes only determined a few weeks later.
The below "after six months" breakdown takes more of that into consideration by
looking at what's on the table after six months (tasks upto Sept 2021). This may be
considered "fairer" in some sense, although has the drawback of suffering from
hindsight bias, and possibly not highlighting current or most urgent areas.
WMF Product:
* Anti Harassment Tools (3): 1 MW Blocks, 2 SecurePoll.
* Community Tech (0).
* Design Systems (1): 1 WVUI.
* Editing Team (15): 14 VisualEditor, 1 OOUI.
* Growth Team (13): 11 Flow, 1 GrowthExperiments, 1 MW Recent changes.
* Language Team (6): 4 ContentTranslation, 1 CX-server, 1 Translate extension.
* Parsoid Team (9): 8 Parsoid, 1 ParserFunctions extension .
* Product Infrastructure: 2 JsonConfig, 1 Kartographer, 1 WikimediaEvents.
* Reading Web (0).
* Structured Data (4): 2 MW Uploading, 1 WikibaseMediaInfo, 1 3D extension.
WMF Tech:
* Data Engineering: 1 EventLogging.
* Fundraising Tech: 1 CentralNotice.
* Performance: 1 Rdbms.
* Platform MediaWiki Team (19): 4 MW-Page-data, 1 MW-REST-API, 1 MW-Action-API, 1
MW-Snapshots, 1 MW-ContentHandler, 1 MW-JobQueue, 1 MW-libs-RequestTimeout, 9 Other.
* Search Platform: 1 MW-Seach.
* SRE Service Operations: 1 Other.
WMDE:
* WMDE-Wikidata (7): 5 Wikibase, 2 Lexeme.
* WMDE-TechWish: 1 FileImporter.
Other:
* Missing steward (7): 2 Graph, 2 LiquidThreads, 2 TimedMediaHandler, MW Contributions 1
page.
* Individually maintained (2): 1 WikimediaIncubator, 1 Score extension.
Trends
In February, we reported 25 new production errors
<https://phabricator.wikimedia.org/maniphest/query/1B79KZ8KkRj6/#R>. Of those, 13
have since been resolved, and 12 remain open as of today (two weeks into the following
month). We also resolved 22 errors that remained open from previous months. The overall
workboard has grown slightly to a total of 301 outstanding error reports.
For the month-over-month graph, refer to the spreadsheet.
<https://docs.google.com/spreadsheets/d/e/2PACX-1vTrUCAI10hIroYDU-i5_8s7pony8M71ATXrFRiXXV7t5-tITZYrTRLGch-3iJbmeG41ZMcj1vGfzZ70/pubhtml>
Thanks!
Thank you to everyone who helped by reporting, investigating, or resolving problems in
Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
🔗 Share or read online via
https://phabricator.wikimedia.org/phame/post/view/267/