Hello!
TL;DR: Our recent elasticsearch cluster restart did not go as planned. Most important lesson learned: we did not understand the recovery settings correctly.
Yesterday, we did a cold restart of the elasticsearch / cirrus eqiad cluster. This restart did not go as planned. It did not generate any user facing impact, since we moved all the traffic to codfw before the restart. It did impact logstash (more of that in a different report).
Incident documentation: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170920-Elastics...
Have fun!
Guillaume