How are we doing in our strive for operational excellence? Read on to find out!
Incidents There were 6 incidents in June this year. That's double the median of three per month, over the past two years (Incident graphs https://codepen.io/Krinkle/full/wbYMZK).
2022-06-01 cloudelastic https://wikitech.wikimedia.org/wiki/Incidents/2022-06-01_Lost_index_in_cloudelastic Impact: For 41 days, Cloudelastic was missing search results about files from commons.wikimedia.org.
2022-06-10 overload varnish haproxy https://wikitech.wikimedia.org/wiki/Incidents/2022-06-10_overload_varnish_haproxy Impact: For 3 minutes, wiki traffic was disrupted in multiple regions for cached and logged-in responses.
2022-06-12 appserver latency https://wikitech.wikimedia.org/wiki/Incidents/2022-06-12_appserver_latency Impact: For 30 minutes, wiki backends were intermittently slow or unresponsive, affecting a portion of logged-in requests and uncached page views.
2022-06-16 MariaDB password https://wikitech.wikimedia.org/wiki/Incidents/2022-06-16_MariaDB_password_leak Impact: For 2 hours, a current production database password was publicly known. Other measures ensured that no data could be compromised (e.g. firewalls and selective IP grants).
2022-06-21 asw-a2-codfw power https://wikitech.wikimedia.org/wiki/Incidents/2022-06-21_asw-a2-codfw_accidental_power_cycle Impact: For 11 minutes, one of the Codfw server racks lost network connectivity. Among the affected servers was an LVS host. Another LVS host in Codfw automatically took over its load balancing responsibility for wiki traffic. During the transition, there was a brief increase in latency for regions served by Codfw (Mexico, and parts of US/Canada).
2022-06-30 asw-a4-codfw power https://wikitech.wikimedia.org/wiki/Incidents/2022-06-30_asw-a4-codfw_accidental_power_cycle Impact: For 18 minutes, servers in the A4-codfw rack lost network connectivity. Little to no external impact.
Incident follow-up Recently completed incident follow-up:
Audit database usage of GlobalBlocking extension https://phabricator.wikimedia.org/T307648 Filed by Amir (Ladsgroup) in May following an outage due to db load from GlobalBlocking. Amir reduced the extensions' DB load by 10%, through avoiding checks for edit traffic from WMCS and Toolforge. And he implemented stats for monitoring GlobalBlocking DB queries going forward.
Reduce Lilypond shellouts from VisualEditor https://phabricator.wikimedia.org/T312319 Filed by Reuven (RLazarus) and Kunal (Legoktm) after a shellbox incident. Ed (Esanders) and Sammy (TheresNoTime) improved the Score extension's VisualEditor plugin to increase its debounce duration.
Remember to review and schedule Incident Follow-up work https://phabricator.wikimedia.org/project/view/4758/ in Phabricator! These are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status https://wikitech.wikimedia.org/wiki/Incident_status on Wikitech.
Trends
In June and July (which is almost over), we reported 27 new production errors https://phabricator.wikimedia.org/maniphest/query/WDqlrITVmIoX/#R and 25 production errors https://phabricator.wikimedia.org/maniphest/query/pzOAOpbnF3PX/#R respectively. Of these 52 new issues, 27 were closed in weeks since then, and 25 remain unresolved and will carry over to August.
We also addressed 25 stagnant problems that we carried over from previous months, thus the workboard overall remains at exactly 299 unresolved production errors.
Take a look at the Wikimedia-production-error https://phabricator.wikimedia.org/tag/wikimedia-production-error/ workboard and look for tasks that could use your help.
💡 *Did you know?* To zoom in and find your team's error reports, use the appropriate "Filter" link in the sidebar of the workboard . For the month-over-month numbers, refer to the spreadsheet data https://docs.google.com/spreadsheets/d/e/2PACX-1vTrUCAI10hIroYDU-i5_8s7pony8M71ATXrFRiXXV7t5-tITZYrTRLGch-3iJbmeG41ZMcj1vGfzZ70/pubhtml.
Thanks!
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
🔗 Share or read later via https://phabricator.wikimedia.org/phame/post/view/292/
wikitech-l@lists.wikimedia.org