[Engineering] [Ops] Canary Deploys for MediaWiki

Mon Jul 25 20:54:49 UTC 2016

Note to deployers: when syncing certain config changes (e.g. adding a new
variable) that touch both InitialiseSettings and CommonSettings, you will
now need to use sync-dir wmf-config, because individual sync-files will
likely fail if the intermediate state throws notices/errors.

(It was a good idea to do this before, but it'll be more strongly enforced
now.)

On Jul 25, 2016 12:35, "Tyler Cipriani" <tcipriani at wikimedia.org> wrote:

> tl;dr: Scap will deploy to canary servers and check for error-log spikes
> in the next version (to be released Soon™).
>
> In light of recent incidents[0] which have created outages accompanied by
> large, easily detectable, error-rate spikes, a patch has recently landed in
> Scap[1] that will:
>
>    1. Push changes to a set of canary servers[2] before syncing to proxy
> servers
>    2. Wait a configurable length of time (currently 20 seconds[3]) for any
> errors to have time to make themselves known
>    3. Query Logstash (using a script written by Gabriel Wicke[4]) to
> determine if the error rate has increased over a configurable threshold
> (currently 10-fold[5])
>
> Big thanks to the folks that helped in this effort: Gabriel Wicke, Filippo
> Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson (for
> their mad Logstash skillz)!
>
> It is noteworthy, that in instances where expedience is required—we're in
> the middle of an outage and who cares what Logstash has to say—the
> `--force` flag can be added to skip canary checks all together (i.e. `scap
> sync-file --force wmf-config/InitialiseSettings 'Panic!!'`).
>
> The RelEng team's eventual goal is still to move MediaWiki deployments to
> the more robust and resillient Scap3 deployment framework. There is some
> high-priority work that has to happen before the Scap3 move. In the
> interim, we are taking steps (like this one) to respond to incidents and
> keep deployments safe.
>
> Hopefully, this work and the error-rate alert work from Ori last week[6]
> will allow everyone to be more conscientious and more keenly aware of
> deployments that cause large aberrations in the rate of errors.
>
> <3,
> Your Friendly Neighborhood Release Engineering Team
>
> [0].
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki
> is the recent example I could find, but there have been others.
> [1]. https://phabricator.wikimedia.org/D248
> [2]. https://gerrit.wikimedia.org/r/#/c/294742/
> [3]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L19
> [4]. https://gerrit.wikimedia.org/r/#/c/292505/
> [5]. https://github.com/wikimedia/scap/blob/master/scap/config.py#L18
> [6]. https://gerrit.wikimedia.org/r/#/c/300327/
>
> _______________________________________________
> Ops mailing list
> Ops at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/ops
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/engineering/attachments/20160725/8246eb76/attachment.html>