See also Incident graphs.
Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations
written down after an incident is concluded. Read about past incidents
at Incident status on Wikitech.
Recently conducted incident follow-up:
Create a dashboard for Prometheus metrics about health of Prometheus itself.
Improve wording around AbuseFilter messages about throttling functionality.
Exclude restart procedure from automated Elasticsearch provisioning.
I skip breakdowns most months as each breakdown has its flaws.
However, I hear people find them useful, so I'll try to do them from
time to time with my noted caveats. The last breakdown was in the December edition,
which focussed on throughput during a typical month. Important to
recognise is that neither high nor low throughput is per-se good or bad.
It's good when issues are detected, reported, and triaged correctly.
It's also good if a team's components are stable and don't produce any
errors. A report may be found to be invalid or a duplicate, which is
sometimes only determined a few weeks later.
The below "after six months" breakdown takes more of that into
consideration by looking at what's on the table after six months (tasks
upto Sept 2021). This may be considered "fairer" in some sense, although
has the drawback of suffering from hindsight bias, and possibly not
highlighting current or most urgent areas.
WMF Product:
WMF Tech:
WMDE:
Other:
In February, we reported 25 new production errors.
Of those, 13 have since been resolved, and 12 remain open as of today
(two weeks into the following month). We also resolved 22 errors that
remained open from previous months. The overall workboard has grown
slightly to a total of 301 outstanding error reports.
For the month-over-month graph, refer to the spreadsheet.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof