Good start in working this one out.
On 27 October 2015 at 11:32, Greg Grossmeier <greg(a)wikimedia.org> wrote:
Hi all,
Thanks, for the discussion. As you can imagine, it is high on my list to
figure out why we've had 2 outages in the past couple weeks caused by
config changes like this (more accurately, what we can do to prevent
it).
I think, after reading Brad's, Oliver's, and Erik's (partial, early
release due to train) responses most of Risker's questions are answered.
I'll just give a bit more from my perspective.
<quote name="Brad Jorsch (Anomie)" date="2015-10-27"
time="10:56:28 -0400">
On Tue, Oct 27, 2015 at 10:29 AM, Risker
<risker.wp(a)gmail.com> wrote:
- Why wasn't it part of the deployment
train
Good question, and one that needs someone involved in this backport to
answer.
Erik and Oliver answered this.
Actually their answer is "there was a major change to the infrastructure
that broke our ability to collect data, thus escalating the need for this
fix to the point that it needed to be done as a SWAT".
So, as we keep drilling down, we see that the cause isn't even this "urgent
fix", it's the previous deploy from some weeks ago that made the "urgent
fix" necessary. And then we need to ask the same questions about that
deploy - were the full range of tests done, are they the right tests, how
long did it take to notice the problem, was any consideration given to
rolling back that problematic deploy, and so on.
And again these are systems issues. I agree that MediaWiki is a complex
system. But one would think, given the significant focus on data
collection, any acceptance testing before a significant infrastructure
change would include a review to ensure that it will not impact data
collection. (This is the kind of thing I mean when I say "are we doing the
right tests".) We know that we've repeatedly seen implementation of
publicly visible/publicly editable content creation extensions that don't
work with Checkuser, aren't logged like content, aren't deletable or
suppressible, can't be edited by others and so on - all of which should be
standard requirements that should be tested for whenever such an extension
is deployed to a production wiki in the Wikimedia cluster (and I'm pretty
sure other MediaWiki users would consider those to be standard expectations
too). But these aren't tested for because they aren't part of the standard
testing suite.
So yes...you're heading in the right direction, but don't think that by
explaining why the urgent patch didn't work as expected, you've explained
how we got to where we are.
Risker/Anne