Re: [Wikitech-l] Wikipedia is down

27 Oct 2015

On Tue, Oct 27, 2015 at 10:29 AM, Risker &lt;risker.wp(a)gmail.com&gt; wrote:

...
     - Why wasn't it part of the deployment train

Good question, and one that needs someone involved in this backport to
answer.

...
     - As a higher level question, what are the
thresholds for using a SWAT
    deployment as opposed to the regular deployment train, are these
 standards
    being followed, and are they the right standards. (Even I notice that
 most
    of the big problems seem to come with deployments outside of the
 deployment
    train.)

My understanding is that SWAT is supposed to be for WMF configuration
changes (i.e. the operations/mediawiki-config repo, which this wasn't) and
for urgent bug fixes that can't wait for the weekly train. But my
understanding might be too strict, so I'd recommend waiting for a more
official answer than mine.

...
     - How was the code reviewed and tested before
deployment

First, it was reviewed before being merged into master. Then the SWAT
deployer is supposed to review the backport for potential issues, although
they may lack the domain-specific knowledge that the original reviewers
have to spot issues like the one here.

...
     - Why did it appear to work in some contexts
(indicated in your response
    as master and Beta Labs) but not in the production context

You're assuming this code wouldn't have worked in the production context if
deployed correctly. It's like asking "Why does it work to change a
lightbulb normally, but it doesn't work if the bulb-changer forgets to
remove the burned-out bulb before trying to put the new one in?"

...
     - How are we ensuring that deployments that require
multiple sequential
    steps are (a) identified and (b) implemented in a way that those steps
 are
    followed in the correct order

It requires that the people proposing/implementing the change identify the
prerequisites required. There's currently no automated way to do this, and
even if some automated mechanism such as "Depends-On" tags on the git
commits were implemented it would require that people correctly use the
mechanism and that the mechanism can be automatically tracked during
backports as well as normal development merges.

There's also the possibility that unit testing could catch such issues when
the changes are merged to the deployment branches before being deployed,
and our Release Engineering team has been working on increasing the number
of extension unit tests run. But that requires we have unit tests that
cover everything, which we don't so things can still slip through. It also
wouldn't handle the case where the individual files of the change are
individually deployed out of order, although at a glance it doesn't seem
like that was the issue here.

Taking this further to discuss plans, implementation, and mitigation of the
remaining process issues is a discussion for the Release Engineering team,
and may already be happening somewhere. Once people in SF get into work
they might have further comments along these lines.

-- 
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Wikipedia is down