Production Excellence (October 2019) - Wikitech-l

7 Nov 2019

📘 Read on Phabricator at
https://phabricator.wikimedia.org/phame/post/view/178/
-------

How’d we do in our strive for operational excellence last month? Read on to
find out!

## 📊 Month in numbers

* 3 documented incidents. [1]
* 33 new Wikimedia-prod-error reports. [2]
* 30 Wikimedia-prod-error reports closed. [3]
* 207 currently open Wikimedia-prod-error reports in total. [4]

There were three recorded incidents last month, which is slightly below our
median of the past two years. Explore this data: <
https://codepen.io/Krinkle/full/wbYMZK>gt;. To read more about these
incidents, their investigations, and pending actionables; check
https://wikitech.wikimedia.org/wiki/Incident_documentation#2019

-------

## *️⃣ To Log or not To Log

MediaWiki uses the PSR-3 compliant Monolog library to send messages to
Logstash (via rsyslog and Kafka [5]). These messages are used to
automatically detect (by quantity) when the production cluster is in an
unstable state. For example, due to an increase in application errors when
deploying code, or if a backend system is failing. Two distinct issues
hampered the storing of these messages this month, and both affected us
simultaneously.

*Elasticsearch mapping limit*
The Elasticsearch storage behind Logstash optimises responses to Logstash
queries with an index. This index has an upper limit to how many distinct
fields (or columns) it can have. When reached, messages with fields not yet
in the index are discarded. Our Logstash indexes are sharded by date and
source (one for “mediawiki”, one for “syslog”, and one for everthing else).

This meant that error messages were only stored if they only contained
fields used before, by other errors stored that day. Which in turn would
only succeed if that day’s columns weren’t already fully taken. A seemingly
random subset of error messages was then rejected for a full day. Each day
it got a new chance at reserving its columns, so long as the specific kind
of error is triggered early enough.

To unblock deployment automation and monitoring of MediaWiki, an interim
solution was devised. The subset of messages from “mediawiki” that deal
with application errors now have their own index shard. These error reports
follow a consistent structure, and contain no free-form context fields. As
such, this index (hopefully) can’t reach its mapping limit or suffer
message loss.

The general index mapping limit was also raised from 1000 to 2000. For now
that means we’re not dropping any non-critical/debug messages. More
information about the incident at https://phabricator.wikimedia.org/T234564.
The general issue with accommodating debug messages in Logstash long-term,
is tracked at https://phabricator.wikimedia.org/T180051. Thanks Bartosz,
and Keith Herron.

*Crash handling*
Wikimedia’s PHP configuration has a “crash handler” that kicks in if
everything else fails. For example, when the memory limit or execution
timeout is reached, or if some crucial part of MediaWiki fails very early
on. In that case our crash handler renders a Wikimedia-branded system error
page (separate from MediaWiki and its skins). It also increments a counter
metric for monitoring purposes, and sends a detailed report to Logstash. In
migrating the crash handler from HHVM to PHP7, one part of the puzzle was
forgotten. Namely the Logstash configuration that forwards these reports
from php-fpm’s syslog channel to the one for mediawiki.

As such, our deployment automation and several Logstash dashboards were
blind to a subset of potential fatal errors for a few days. Regressions
during that week were instead found by manually digging through the raw
feed of the php-fpm channel instead. As a temporary measure, Scap was
updated to consider the php-fpm’s channel as well in its automation that
decides whether a deployment is “green”.

We’ve created new Logstash configurations that forward PHP7 crashes in a
similar way as we did for HHVM in the past. Bookmarked MW
dashboards/queries you have for Logstash now provide a complete picture
once again. Thanks Effie, and Cole White! –
https://phabricator.wikimedia.org/T234283

-------

## 📉  Outstanding reports

Take a look at the workboard and look for tasks that might need your help.
The workboard lists error reports, grouped by the month in which they were
first observed.

→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Or help someone that’s already started with their patch:
→  https://phabricator.wikimedia.org/maniphest/query/CFzrDj3vFbE_/#R

Breakdown of recent months (past two weeks not included):

* March: 1 report fixed. (3 of 10 reports left).
* April: 8 of 14 reports left (unchanged). ⚠️
* May: (All clear!)
* June: 9 of 11 reports left (unchanged). ⚠️
* July: 13 of 18 reports left (unchanged).
* August: 2 reports were fixed! (6 of 14 reports left).
* September: 2 reports were fixed! (10 of 12 new reports left).
* October: 12 new reports survived the month of October.

-------

##### 🎉 Thanks!

Thank you, to everyone else who helped by reporting, investigating, or
resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

🌴“Gotta love crab. In time, too. I couldn't take much more of
those coconuts. Coconut milk is a natural laxative. That's something
Gilligan never told us.”

-------

Footnotes:

[1] Incidents. –
https://wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident+doc…

[2] Tasks created. –
https://phabricator.wikimedia.org/maniphest/query/sJI9Af6LqvKL/#R

[3] Tasks closed. –
https://phabricator.wikimedia.org/maniphest/query/scW28HMEJemU/#R

[4] Open tasks. –
https://phabricator.wikimedia.org/maniphest/query/47MGY8BUDvRD/#R

[5] https://wikitech.wikimedia.org/wiki/Logstash/Interface