tl;dr: Scap will deploy to canary servers and check for error-log spikes in the next
version (to be released Soon™).
In light of recent incidents[0] which have created outages accompanied by large, easily
detectable, error-rate spikes, a patch has recently landed in Scap[1] that will:
1. Push changes to a set of canary servers[2] before syncing to proxy servers
2. Wait a configurable length of time (currently 20 seconds[3]) for any errors to have
time to make themselves known
3. Query Logstash (using a script written by Gabriel Wicke[4]) to determine if the
error rate has increased over a configurable threshold (currently 10-fold[5])
Big thanks to the folks that helped in this effort: Gabriel Wicke, Filippo Giunchedi and
Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson (for their mad Logstash skillz)!
It is noteworthy, that in instances where expedience is required—we're in the middle
of an outage and who cares what Logstash has to say—the `--force` flag can be added to
skip canary checks all together (i.e. `scap sync-file --force
wmf-config/InitialiseSettings 'Panic!!'`).
The RelEng team's eventual goal is still to move MediaWiki deployments to the more
robust and resillient Scap3 deployment framework. There is some high-priority work that
has to happen before the Scap3 move. In the interim, we are taking steps (like this one)
to respond to incidents and keep deployments safe.
Hopefully, this work and the error-rate alert work from Ori last week[6] will allow
everyone to be more conscientious and more keenly aware of deployments that cause large
aberrations in the rate of errors.
<3,
Your Friendly Neighborhood Release Engineering Team
[0].
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWi… is the
recent example I could find, but there have been others.
[1].
https://phabricator.wikimedia.org/D248
[2].
https://gerrit.wikimedia.org/r/#/c/294742/
[3].
https://github.com/wikimedia/scap/blob/master/scap/config.py#L19
[4].
https://gerrit.wikimedia.org/r/#/c/292505/
[5].
https://github.com/wikimedia/scap/blob/master/scap/config.py#L18
[6].
https://gerrit.wikimedia.org/r/#/c/300327/