[QA] [Ops] security patches handling

Wed Jan 25 19:20:32 UTC 2017

Hi! Replies inline...

On Wed, Jan 25, 2017 at 10:59 AM Stas Malyshev <smalyshev at wikimedia.org>
wrote:

> Hi!
>
> Yesterday we've had search broken on mediawiki.org for some time, which
> is seemingly caused by a security patch not being applied correctly to
> production code. I am not sure what chain of events led to it (and
> please feel free to correct me if I misunderstood anything) but it makes
> me a bit worried, for the following reasons:
>
> 1. We had broken code which is not in gerrit or actually anywhere in
> repos running in production.
>
>
Half-true. The code is not in gerrit but it *is* committed locally to the
deployment branches on the repos. They're always rebased on top
so they're immediately visible in the git log.

> 2. As far as I know, none of people knowing the actual code that was
> broken knew there were problems with applying the patch (even though it
> has .failed as filename in filesystem, so I wonder what happened there)
> or that this particular piece of code was substantially different in
> production than in repos. Of course we've figured it out but it was not
> immediately known AFAIK.
>
>
Yes, we did know there were problems applying these patches. This current
set of security patches have been particularly problematic in rebasing each
week for the train. As discussed on IRC last night, the problem comes in
trying to sort out the merge conflicts. Most of the time we get it right and
nobody notices, but the context here wasn't great and we made a mistake.

The *.failed was just a lack of cleanup after resolving the conflicts...the
patches *were* applied.

> 3. We knew about the problem only after stumbling on it by chance on
> mediawiki.org. It could as well be less visible but much more critical
> and would be deployed further by production train before we noticed it.
>
>
Testing in beta/labs is great, but security patches by definition won't be
there.

That's the reason for group0 deployments on Tuesday and wider deployments
on Wed/Thurs. There is a group0 dashboard in Logstash that *everyone*
doing the train conducting should be watching. New things always need
triage,
if not resolution prior to rolling out further.

Had we not gotten a fix in place yesterday, I would've halted the train for
wider deployment.

> 4. We have out-of-repo code living on our production long term, while
> code underneath it changes, and developers of the code being patched are
> not aware of this going on. These changes may as well have invalidated
> the security patch instead of producing visible error and we wouldn't know.
>
>
Yep, this has always been a risk and has put a ton of cognitive load on
the people running the train.

> I understand the challenges with providing and applying security patches
> for code that is both public and production under heavy use.  But the
> failure scenario we've had is rather worrying for me and I wonder if we
> could do better in handling security patches to lower the probability of
> this happening in the future.
>
>
Yes, there are major problems with how we're doing this now.

> I don't know enough about security patches handling process to propose
> anything, but I hope that people who do would help with this, so I'd
> like to initiate the discussion about it.
>
>
So, the security patch process is basically thus:
0) Security patches are stored in /srv/patches/$VERSION/ on deployment
masters (tin/mira)
1) When the new branch is cut and checked out to the deployment master
on tuesday, patches are copied from the previous version in /srv/patches/
2) Apply them and pray they don't conflict
3) Rebase patches if needed, save updated versions to /srv/patches/
4) Sync code live

This falls apart in (3), which requires the deployer to patch code with
which
they very likely are unfamiliar. We (mostly Mukunda) have been working on
scap functionality to automate steps 1-2, but 3 is always going to happen.

To help the situation, I think a few things are necessary:
- We (me, mostly) need to *drastically* cut down the time it takes to get
security releases public. Carrying non-master fixes in production should
be the exception, not the norm. This also helps us keep non-production
wikis safer (I'm looking at you, beta)
- Using merges/shared git history between deployment branches instead
of patchfiles would probably simplify a lot of this, needs further thinking
through though
- A *clear* owner for any security patches in production (without having
to dig through Phabricator). This person must be consulted on any
non-trivial
conflict, imho.

We need to do better...

-Chad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/qa/attachments/20170125/7dc2e15b/attachment.html>