How’d we do in our strive for operational excellence last month? Read on to find out!
Incidents
Last month we experienced 2 (public) incidents. This is below the three-year median of 3
incidents a month (Incident graphs <https://codepen.io/Krinkle/full/wbYMZK>).
2022-04-06 esams network
<https://wikitech.wikimedia.org/wiki/Incidents/2022-04-06_esams_network>
Impact: For 30 minutes, wikis were slow or unreachable for a portion of clients to the
Esams data center. Esams is one of two DCs primarily serving Europe, Middle East, and
Africa.
2022-04-26 cr2-eqord down
<https://wikitech.wikimedia.org/wiki/Incidents/2022-04-26_cr2-eqord_down>
Impact: No external impact. Internally, for 2 hours we were unable to access our Eqord
routers by any means. This was due to a fiber cut on a redundant link to Eqiad, which then
coincided with planned vendor maintenance on the links to Ulsfo and Eqiad. See also
Network design <https://wikitech.wikimedia.org/wiki/Network_design>.
Incident follow-up
Remember to review and schedule Incident Follow-up work
<https://phabricator.wikimedia.org/project/view/4758/> in Phabricator, which are
preventive measures and tech debt mitigations written down after an incident is concluded.
Read more about past incidents at Incident status
<https://wikitech.wikimedia.org/wiki/Incident_status> on Wikitech.
Recently resolved incident follow-up:
Reduce mysql grants for wikiadmin scripts
<https://phabricator.wikimedia.org/T249683>
Filed in 2020 after the wikidata drop-table incident (details
<https://wikitech.wikimedia.org/wiki/Incidents/2020-04-07_Wikidata%27s_wb_items_per_site_table_dropped>).
Carried out over the last six months by Ladsgroup (SRE Data Persistence).
Improve reliability of Toolforge k8s cron jobs
<https://phabricator.wikimedia.org/T308204> and Re-enable CronJobControllerV2
<https://phabricator.wikimedia.org/T308205>
Filed earlier this week after a Toolforge incident and carried out by Majavah.
Trends
During the month of April we reported 27 new production errors
<https://phabricator.wikimedia.org/maniphest/query/OZ99DkeJf85D/#R>. Of these new
errors, we resolved 14, and the remaining 13 are still open and have carried over to May.
Last month, the workboard totalled 298 unresolved error reports. Of these older reports
that carried over from previous months, 16 were resolved. Most of these were reports from
before 2019.
The new total, including some tasks for the current month of May, is 292. A slight
decrease! (spreadsheet
<https://docs.google.com/spreadsheets/d/e/2PACX-1vTrUCAI10hIroYDU-i5_8s7pony8M71ATXrFRiXXV7t5-tITZYrTRLGch-3iJbmeG41ZMcj1vGfzZ70/pubhtml>).
Take a look at the workboard and look for tasks that could use your help.
→
https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Thanks!
Thank you to everyone who helped by reporting, investigating, or resolving problems in
Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
🔗 Share or read later via
https://phabricator.wikimedia.org/phame/post/view/284/