<p dir="ltr">Note to deployers: when syncing certain config changes (e.g. adding a new variable) that touch both InitialiseSettings and CommonSettings, you will now need to use sync-dir wmf-config, because individual sync-files will likely fail if the intermediate state throws notices/errors.</p>
<p dir="ltr">(It was a good idea to do this before, but it'll be more strongly enforced now.)</p>
<div class="gmail_extra"><br><div class="gmail_quote">On Jul 25, 2016 12:35, "Tyler Cipriani" <<a href="mailto:tcipriani@wikimedia.org">tcipriani@wikimedia.org</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">tl;dr: Scap will deploy to canary servers and check for error-log spikes in the next version (to be released Soon™).<br>
<br>
In light of recent incidents[0] which have created outages accompanied by large, easily detectable, error-rate spikes, a patch has recently landed in Scap[1] that will:<br>
<br>
1. Push changes to a set of canary servers[2] before syncing to proxy servers<br>
2. Wait a configurable length of time (currently 20 seconds[3]) for any errors to have time to make themselves known<br>
3. Query Logstash (using a script written by Gabriel Wicke[4]) to determine if the error rate has increased over a configurable threshold (currently 10-fold[5])<br>
<br>
Big thanks to the folks that helped in this effort: Gabriel Wicke, Filippo Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson (for their mad Logstash skillz)!<br>
<br>
It is noteworthy, that in instances where expedience is required—we're in the middle of an outage and who cares what Logstash has to say—the `--force` flag can be added to skip canary checks all together (i.e. `scap sync-file --force wmf-config/InitialiseSettings 'Panic!!'`).<br>
<br>
The RelEng team's eventual goal is still to move MediaWiki deployments to the more robust and resillient Scap3 deployment framework. There is some high-priority work that has to happen before the Scap3 move. In the interim, we are taking steps (like this one) to respond to incidents and keep deployments safe.<br>
<br>
Hopefully, this work and the error-rate alert work from Ori last week[6] will allow everyone to be more conscientious and more keenly aware of deployments that cause large aberrations in the rate of errors.<br>
<br>
<3,<br>
Your Friendly Neighborhood Release Engineering Team<br>
<br>
[0]. <a href="https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki" rel="noreferrer" target="_blank">https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki</a> is the recent example I could find, but there have been others.<br>
[1]. <a href="https://phabricator.wikimedia.org/D248" rel="noreferrer" target="_blank">https://phabricator.wikimedia.org/D248</a><br>
[2]. <a href="https://gerrit.wikimedia.org/r/#/c/294742/" rel="noreferrer" target="_blank">https://gerrit.wikimedia.org/r/#/c/294742/</a><br>
[3]. <a href="https://github.com/wikimedia/scap/blob/master/scap/config.py#L19" rel="noreferrer" target="_blank">https://github.com/wikimedia/scap/blob/master/scap/config.py#L19</a><br>
[4]. <a href="https://gerrit.wikimedia.org/r/#/c/292505/" rel="noreferrer" target="_blank">https://gerrit.wikimedia.org/r/#/c/292505/</a><br>
[5]. <a href="https://github.com/wikimedia/scap/blob/master/scap/config.py#L18" rel="noreferrer" target="_blank">https://github.com/wikimedia/scap/blob/master/scap/config.py#L18</a><br>
[6]. <a href="https://gerrit.wikimedia.org/r/#/c/300327/" rel="noreferrer" target="_blank">https://gerrit.wikimedia.org/r/#/c/300327/</a><br>
<br>
_______________________________________________<br>
Ops mailing list<br>
<a href="mailto:Ops@lists.wikimedia.org" target="_blank">Ops@lists.wikimedia.org</a><br>
<a href="https://lists.wikimedia.org/mailman/listinfo/ops" rel="noreferrer" target="_blank">https://lists.wikimedia.org/mailman/listinfo/ops</a><br>
</blockquote></div></div>