tl;dr: There's a large backlog of production errors. Release Engineering is
blocking the train for any new logspam. Your help is needed!
A quick update on the deployment train:
In the process of rolling out wmf/1.36.0-wmf.28 there were a number of
issues that prevented us from rolling forward the train in a timely manner.
After the issues were resolved and backports deployed to the current
version in production (wmf/1.36.0-wmf.27), we realized there were a few
remaining spammy log messages and blocked the following week's train on
Release Engineering has long blocked the train on logspam issues. Even
when it does not indicate user-facing errors, logspam of any kind makes it
harder for us to see real problems. We have, however, defaulted to pushing
forward the train despite minor issues.
Under this custom, many log messages have been accepted as "just
occassional, not a big deal" or "yeah, we'll fix that eventually...
not a big deal". Frequently, "eventually" never arrives. This results in
unmanageable accumulation of exceptions (see the ever-growing list of
exceptions in the Wikimedia-production-error workboard and logstash).
To deal with these issues we are now, as a matter of policy, blocking
trains that cause any new error messages. In most cases new errors are the
result of code changes that lack defensive coding practices and/or have
unexpected interactions with other code. The best resolution in these cases
is for the code to be fixed or reverted.
Release Engineering organises a weekly "train log triage" meeting, on
Wednesdays at 19:00 UTC, where we invite people who develop MediaWiki to
help triage log messages. As of this week, there is also a second one, on
Thursdays at 10:00 UTC, to be more suitable for people in EU time zones. We
invite everyone who develops MediaWiki or its extensions to join one of the
meetings each week.
| Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E |
| Dir. Engineering Productivity A18D 1138 8E47 FAC8 1C7D |