If the intermediate state throws notices/errors, wouldn't it be a better
idea to sync-file in the correct order to prevent such notices/errors?
On 25 July 2016 at 21:54, Roan Kattouw <roan.kattouw(a)gmail.com> wrote:
Note to deployers: when syncing certain config changes
(e.g. adding a new
variable) that touch both InitialiseSettings and CommonSettings, you will
now need to use sync-dir wmf-config, because individual sync-files will
likely fail if the intermediate state throws notices/errors.
(It was a good idea to do this before, but it'll be more strongly enforced
now.)
On Jul 25, 2016 12:35, "Tyler Cipriani" <tcipriani(a)wikimedia.org> wrote:
tl;dr: Scap will deploy to canary servers and
check for error-log spikes
in the next version (to be released Soon™).
In light of recent incidents[0] which have created outages accompanied by
large, easily detectable, error-rate spikes, a patch has recently landed in
Scap[1] that will:
1. Push changes to a set of canary servers[2] before syncing to proxy
servers
2. Wait a configurable length of time (currently 20 seconds[3]) for
any errors to have time to make themselves known
3. Query Logstash (using a script written by Gabriel Wicke[4]) to
determine if the error rate has increased over a configurable threshold
(currently 10-fold[5])
Big thanks to the folks that helped in this effort: Gabriel Wicke,
Filippo Giunchedi and Giuseppe Lavagetto, Bryan Davis and Erik Bernhardson
(for their mad Logstash skillz)!
It is noteworthy, that in instances where expedience is required—we're in
the middle of an outage and who cares what Logstash has to say—the
`--force` flag can be added to skip canary checks all together (i.e. `scap
sync-file --force wmf-config/InitialiseSettings 'Panic!!'`).
The RelEng team's eventual goal is still to move MediaWiki deployments to
the more robust and resillient Scap3 deployment framework. There is some
high-priority work that has to happen before the Scap3 move. In the
interim, we are taking steps (like this one) to respond to incidents and
keep deployments safe.
Hopefully, this work and the error-rate alert work from Ori last week[6]
will allow everyone to be more conscientious and more keenly aware of
deployments that cause large aberrations in the rate of errors.
<3,
Your Friendly Neighborhood Release Engineering Team
[0].
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWi…
is the recent example I could find, but there have been others.
[1].
https://phabricator.wikimedia.org/D248
[2].
https://gerrit.wikimedia.org/r/#/c/294742/
[3].
https://github.com/wikimedia/scap/blob/master/scap/config.py#L19
[4].
https://gerrit.wikimedia.org/r/#/c/292505/
[5].
https://github.com/wikimedia/scap/blob/master/scap/config.py#L18
[6].
https://gerrit.wikimedia.org/r/#/c/300327/
_______________________________________________
Ops mailing list
Ops(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops
_______________________________________________
Ops mailing list
Ops(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops