📘 Read on Phabricator at https://phabricator.wikimedia.org/phame/post/view/172 -------
How’d we do in our strive for operational excellence in August? Read on to find out!
## 📊 Month in numbers
* 3 documented incidents. [1] * 42 new Wikimedia-prod-error reports. [2] * 31 Wikimedia-prod-error reports closed. [3] * 210 currently open Wikimedia-prod-error reports in total. [4]
The number of recorded incidents in August, at three, was below average for the year so far. However, in previous years (2017-2018), August also has 2-3 incidents. – Explore the data at https://codepen.io/Krinkle/full/wbYMZK
To read more about these incidents, their investigations, and pending actionables; check https://wikitech.wikimedia.org/wiki/Incident_documentation#2019
## *️⃣ When you have eliminated the impossible...
Reports from Logstash indicated that some user requests were aborted by a fatal PHP error from the MessageCache class. The user would be shown a generic system error page. The affected requests didn’t seem to have anything obvious in common, however. This made it difficult to diagnose.
MessageCache is responsible for fetching interface messages, such as the localised word “Edit” on the edit button. It calls a “load()” function and then tries to access the loaded information. However, sometimes the load function would claimed to have finished its work, but yet the information was not there.
When the load function initialises all the messages for a particular language, it keeps track of this, so as to not do the same a second time.
From any one angle I could look at this code, no obvious mistakes stood
out. A deeper investigation revealed that two unrelated changes (more than a year apart), each broke 1 assumption that was safe to break. But, put together, and this seemingly impossible problem emerges. Check out the details of the investigation at https://phabricator.wikimedia.org/T208897#5373846.
## 📉 Outstanding reports
Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Or help someone that’s already started with their patch: → https://phabricator.wikimedia.org/maniphest/query/pzVPXPeMfRIz/#R
Breakdown of recent months (past two weeks not included):
* January: 1 report left (unchanged). * February: 2 reports left (unchanged). ⚠️ * March: 4 reports left (unchanged). ⚠️ * April: 2 reports got fixed! (8 of 14 reports left). * May: 4 of 10 reports left (unchanged). ⚠️ * June: 1 report got fixed! (8 of 11 reports left). * July: 2 reports got fixed (17 of 18 reports left). * August: 14 new reports remain unsolved. * September: 11 new reports remain unsolved.
-------
## 🎉 Thanks!
Thank you to Aaron Schulz, Daimona, David Barratt, James Forrester, Kosta Harlan, Piotr Miazga, Roan Kattouw, Tom Arrow, Željko Filipin, and everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
-------
Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident+docu...
[2] Tasks created. – https://phabricator.wikimedia.org/maniphest/query/8fpsoBLrmlFu/#R
[3] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/U9.KRVNW52Yb/#R
[4] Open tasks. – https://phabricator.wikimedia.org/maniphest/query/47MGY8BUDvRD/#R