My phone sent the above writing as the train jostled. I'll re send after
finishing at the office
On Oct 27, 2015 8:04 AM, "Erik Bernhardson" <ebernhardson(a)wikimedia.org>
wrote:
On Oct 27, 2015 7:29 AM, "Risker" <risker.wp(a)gmail.com> wrote:
On 27 October 2015 at 09:57, Brad Jorsch (Anomie) <bjorsch(a)wikimedia.org
wrote:
> On Tue, Oct 27, 2015 at 8:02 AM, Risker <risker.wp(a)gmail.com> wrote:
>
> > The incident report does not go far enough back into the history of
the
> > incident. It does not explain how this
code managed to get into the
> > deployment chain with a fatal error in it.
>
>
> Actually, it does. Erik writes "This occured because the patch for the
> CirrusSearch repository that removed the schema should have been
deployed
> before the change that adds it to the
WikimediaEvents repository."
>
> In other words, there was nothing wrong with the code itself. The
problem
> was that the multiple pieces of the change
needed to be done in a
> particular order during the manual backporting process, but they were
not
> done in that order.
>
> If this had waited for the train deployment, both pieces would have
been
> done simultaneously and it wouldn't have
been an issue, just as it
wasn't
> an issue when these changes were done in
master and automatically
deployed
to Beta
Labs.
That's a start, Brad. But even as someone who has limited experience
with
software deployment, I can think of at least
half a dozen questions that
I'd be asking here:
- Why wasn't it part of the deployment train
This was a fix for something
that broke during the previous deployment
train. Specifically a hook was changed in core and not noticed in the
extenaion until the events from javascript stopped coming into our logging
tables.
- As a higher level question, what are the
thresholds for using a SWAT
deployment as opposed to the regular deployment train, are these
standards
being followed, and are they the right
standards. (Even I notice that
most
of the big problems seem to come with
deployments outside of the
deployment
train.)
This is documented at
https://wikitech.wikimedia.org/wiki/SWAT_deploys.
I'm not sure about previous outages but in this case the patch matches the
documented limits. My intuition is a that a dep
- How was the code reviewed and tested before
deployment
Code was re
- Why did it appear to work in some contexts
(indicated in your
response
as master and Beta Labs) but not in the
production context
Because, as stated in the report and by brad, the code itself
works. The
code was redeployed after the outage with no errors because the second time
it was deployed in the correct order. This is why code review didn't catch
the fatal and the error didn't show up in beta labs. This was an issue
primarily with deployment process.
- How are we ensuring that deployments that
require multiple
sequential
steps are (a) identified and (b) implemented
in a way that those
steps are
followed in the correct order
Notice how none of the questions are "what was wrong with the code" or
"who
screwed up". They're all systems
questions. This is a systems problem.
Even in situations where there *is* a problem with the code or someone
*did* screw up, the root cause usually comes back to having single points
of failure (e.g. one person having the ability to [unintentionally] get
problem code deployed, or weaknesses in the code review and testing
process).
Risker/Anne
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
At a higher level, this was a 9 minute outage instead of a 2 or 3 minute
outage due to two mistakes I made while doing the revert. Both of these are
in the incident report. First the monitor I was watching from our logserver
to tell me it needs a rollback did not report this error adding a minute or
two before the rollback started. We have other monitors that have been
added in the past year that I should have been looking at as well. Second I
reverted multiple patches from within gerrit (our code review tool) which
takes too long when the site is down. I can only point to inexperience
here, others who have previously taken our sites down informed me that the
proper was is to revert is directly on the deployment server. Iv been
deploying patches and to wmf for a couple years and have always in the past
reverted through gerrit, but those didn't need the extra speedy recovery as
the site was not down, it was only logging errors or some specific piece of
functionality was not working.
Going up another level comes to our deployment tooling specifically.
RelEng is working on a project called scap3 which brings our deployment
process closer to what you should expect from a top 10 website. It includes
canary deployments (eg 1% of servers) along with a single command that
undoes the entire deployment. Canary deployments allow to see an error
before it is deployed everywhere, and a one command rollback operation
would have likely brought the site back 3 to 4 minutes faster than how I
reverted the patches.
I did not link the scap3 portions as an actionable because, in my mind,
that's not a single actionable thing. Scap3 is a major overhaul of our
deploy process. Additionally this is already a priority in RelEng.