Hi all!
tl;dr: There's a large backlog of production errors. Release Engineering is blocking the train for any new logspam. Your help is needed!
A quick update on the deployment train:
In
the process of rolling out wmf/1.36.0-wmf.28 there were a number of
issues that prevented us from rolling forward the train in a timely
manner. After the issues were resolved and backports deployed to the
current version in production (wmf/1.36.0-wmf.27), we realized there
were a few remaining spammy log messages and blocked the following
week's train on those issues.
Release
Engineering has long blocked the train on logspam issues[0]. Even when
it does not indicate user-facing errors, logspam of any kind makes it
harder for us to see real problems. We have, however, defaulted to
pushing forward the train despite minor issues.
Under
this custom, many log messages have been accepted as "just occassional,
not a big deal" or "yeah, we'll fix that eventually... it's not a big
deal". Frequently, "eventually" never arrives. This results in an
unmanageable accumulation of exceptions (see the ever-growing list of
exceptions in the Wikimedia-production-error workboard[1] and
logstash[2]).
To
deal with these issues we are now, as a matter of policy, blocking
trains that cause any new error messages. In most cases new errors are
the result of code changes that lack defensive coding practices and/or
have unexpected interactions with other code. The best resolution in
these cases is for the code to be fixed or reverted.
Release
Engineering organises a weekly "train log triage" meeting, on
Wednesdays at 19:00 UTC, where we invite people who develop MediaWiki to
help triage log messages. As of this week, there is also a second one,
on Thursdays at 10:00 UTC, to be more suitable for people in EU time
zones. We invite everyone who develops MediaWiki or its extensions to
join one of the meetings each week.
Thank you,
Greg
--
| Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E |
| Dir. Engineering Productivity A18D 1138 8E47 FAC8 1C7D |