Impact: For 3 minutes, clients served by the ulsfo POP were not able to contribute or display un-cached pages.

Impact: For 2 hours, WDQS updates failed to be processed. Most bots and tools were unable to edit Wikidata during this time.

2022-02-22 vrts

Impact: For 12 hours, incoming emails to a specific recently created VRTS queue were not processed with senders receiving a bounce with an SMTP 550 Error.

See also Incident graphs.

Incident follow-up

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.

Recently conducted incident follow-up:

Create a dashboard for Prometheus metrics about health of Prometheus itself.

Pitched by CDanis after an April 2019 incident, carried by Filippo (@fgiunchedi).

Improve wording around AbuseFilter messages about throttling functionality.

Originally filed in 2018. This came up last month during an incident where the wording may've led to a misunderstanding. Now resolved by @Daimona.

Exclude restart procedure from automated Elasticsearch provisioning.

I skip breakdowns most months as each breakdown has its flaws. However, I hear people find them useful, so I'll try to do them from time to time with my noted caveats. The last breakdown was in the December edition, which focussed on throughput during a typical month. Important to recognise is that neither high nor low throughput is per-se good or bad. It's good when issues are detected, reported, and triaged correctly. It's also good if a team's components are stable and don't produce any errors. A report may be found to be invalid or a duplicate, which is sometimes only determined a few weeks later.

The below "after six months" breakdown takes more of that into consideration by looking at what's on the table after six months (tasks upto Sept 2021). This may be considered "fairer" in some sense, although has the drawback of suffering from hindsight bias, and possibly not highlighting current or most urgent areas.

WMF Product:

Anti Harassment Tools (3): 1 MW Blocks, 2 SecurePoll.
Community Tech (0).
Design Systems (1): 1 WVUI.
Editing Team (15): 14 VisualEditor, 1 OOUI.
Growth Team (13): 11 Flow, 1 GrowthExperiments, 1 MW Recent changes.
Language Team (6): 4 ContentTranslation, 1 CX-server, 1 Translate extension.
Parsoid Team (9): 8 Parsoid, 1 ParserFunctions extension .
Product Infrastructure: 2 JsonConfig, 1 Kartographer, 1 WikimediaEvents.
Reading Web (0).
Structured Data (4): 2 MW Uploading, 1 WikibaseMediaInfo, 1 3D extension.

WMF Tech:

Data Engineering: 1 EventLogging.
Fundraising Tech: 1 CentralNotice.
Performance: 1 Rdbms.
Platform MediaWiki Team (19): 4 MW-Page-data, 1 MW-REST-API, 1 MW-Action-API, 1 MW-Snapshots, 1 MW-ContentHandler, 1 MW-JobQueue, 1 MW-libs-RequestTimeout, 9 Other.
Search Platform: 1 MW-Seach.
SRE Service Operations: 1 Other.

WMDE:

WMDE-Wikidata (7): 5 Wikibase, 2 Lexeme.
WMDE-TechWish: 1 FileImporter.

Other:

Missing steward (7): 2 Graph, 2 LiquidThreads, 2 TimedMediaHandler, MW Contributions 1 page.
Individually maintained (2): 1 WikimediaIncubator, 1 Score extension.

In February, we reported 25 new production errors. Of those, 13 have since been resolved, and 12 remain open as of today (two weeks into the following month). We also resolved 22 errors that remained open from previous months. The overall workboard has grown slightly to a total of 301 outstanding error reports.

For the month-over-month graph, refer to the spreadsheet.

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof