Thanks as always for this report, Timo.

One reason the count is higher in May is because that's when the Growth team began implementing a chores process (credit to Readers Web for the inspiration) to systematically review and log production errors that appear on our team dashboard in Logstash. (We've also implemented a triage process for our inbox, which used to have ~2000 tasks and is now at 10.) Some of the tasks we've filed from Logstash are probably duplicates or close relatives of existing production error tasks, but because we are trying to timebox our triage process, we don't always succeed in ensuring that we identify existing tasks before filing new ones.

A bigger problem is how to handle our growing pile of tasks that need some attention; as a team that's tasked with feature development, making time to work on maintenance tasks unrelated to the code we touch day-to-day is a challenge. So, while we are going to be more diligent about filing tasks when we see issues in Logstash, unless something appears to be badly broken, it is probably going to stay as an open task.

Kosta

On Mon, Jun 21, 2021 at 4:55 AM Krinkle <krinklemail@gmail.com> wrote:

How’d we do in our strive for operational excellence last month? Read on to find out!

Read on Phabricator at https://phabricator.wikimedia.org/phame/post/view/236/

Incidents

Zero incidents recorded in the past month. Yay! That's only five months after November 2020, the last month without documented incidents (Incident stats).

Remember to review Preventive measures in Phabricator, which are action items filed after an incident.

-------

Trends

In May, we unfortunately saw a repeat of the worrying pattern we saw in April, but with higher numbers. We found 54 new errors. This is the most new errors in a single month, since the Excellence monthly began three years ago in 2018. About half of these (29 of 54) remain unresolved as of writing, two weeks into the following month.

Figure 1, Figure 2: Unresolved error reports stacked by month.

Month-over-month plots based on spreadsheet data.

-------

New errors in May

Below is a snapshot of just the 54 new issues found last month, listed by their code steward.

Be mindful that the reporting of errors is not itself a negative point per-se. I think it should be celebrated when teams have good telemetry, detect their issues early, and address them within their development cycle. It might be more worrisome when teams lack telemetry or time to find such issues, or can't keep up with the pace at which issues are found.

Anti Harassment ToolsNone.
Community TechNone.
Editing Team+2, -1Cite (T283755); OOUI (T282176).
Growth Team+17, -4Add-Link (T281960); GrowthExperiments (T281525 T281703 T283546 T283638 T283924); Echo (T282446); Recent-changes (T282047 T282726); StructuredDiscussions (T281521 T281523 T281782 T281784 T282069 T282146 T282599 T282605).
Language Team+1Translate extension (T283828).
Parsing Team+1Parsoid (T281932).
Reading WebNone.
Structured DataNone.
Product Infra Team+1WikimediaEvents (T282580).
AnalyticsNone.
Performance TeamNone.
Platform Engineering+16, -11MediaWiki-API (T282122); MediaWiki-General (T282173); MediaWiki-Page-derived-data (T281714 T281802 T282180 T283282), MediaWiki-Revision-backend (T282145 T282723 T282825 T283170); MediaWiki-User-management (T283167); MW Expedition (T281526 T281981 T282038 T282181 T283196).
Search Platform+3, -2CirrusSearch (T282036 T282207); GeoData (T282735).
WMDE TechWish+2, -1Revision-Slider (T282067); VisualEditor Template dialog (T283511).
WMDE Wikidata+3, -1Wikibase (T282534 T283198 T283862).
No owner+7, -6CentralAuth (T282834 T283635); Change-tagging (T283098 T283099); MapSources (T282833); MediaWiki-Page-information (T283751); Other (T283252).
-------

Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Summary over recent months:

Aug 2019 (0 of 14 left)✅ Last task resolved!-1
Jan 2020 (1 of 7 left)⚠️ Unchanged (over one year old).
Mar 2020 (2 of 2 left)⚠️ Unchanged (over one year old).
Apr 2020 (4 of 14 left)⬇️ One task resolved.-1
May 2020 (5 of 14 left)⚠️ Unchanged (over one year old).
Jun 2020 (5 of 14 left)⚠️ Unchanged (over one year old).
Jul 2020 (4 of 24 issues)⏸ —
Aug 2020 (12 of 53 issues)⬇️ One task resolved.-1
Sep 2020 (7 of 33 issues)⏸ —
Oct 2020 (19 of 69 issues)⬇️ One task resolved.-1
Nov 2020 (8 of 38 issues)⬇️ One task resolved.-1
Dec 2020 (7 of 33 issues)⏸ —
Jan 2021 (3 of 50 issues)⏸ —
Feb 2021 (7 of 20 issues)⬇️ One task resolved.-1
Mar 2021 (14 of 48 issues)⬇️ Four tasks resolved.-4
Apr 2021 (23 of 42 issues)⬇️ Two tasks resolved.-2
May 2021 (29 of 54 issues)54 new issues found, of which 29 remain open.+54; -25

-------

Tally
133issues open, as of Excellence #31 (12 May 2021).
-12issues closed, of the previous 133 open issues.
+29new issues that survived May 2021.
150issues open, as of today (12 June 2021).

-------

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/