We have recently experienced multiple instabilities of our elasticsearch cluster in codfw. First one on 2016-04-27 identified around 10am UTC, second one on 2016-05-02 identified around 11pm UTC.
In both cases the symptoms were similar:
* a cluster restart was in progress to modify cluster discovery strategy (moving from multicast to unicast [1]). * cluster wide operations (get list of nodes, shards, changing cluster settings) were extremely slow * number of pending tasks (`curl -s localhost:9200/_cluster/health?pretty | jq .number_of_pending_tasks`) was high * most of those tasks were deletion of shards (`curl -s localhost:9200/_cat/pending_tasks | grep indices_store | wc -l`) * a few of them were deletion and creation of the "mediawiki_cirrussearch_frozen_indexes" index * response time of client requests did not seem to be affected
Only "more like" traffic is going to codfw. On the 27, we switched this traffic to eqiad to buy time for investigation and recovery [2]. We found a copy/paste error in the cluster discovery configuration [3]. After a full cluster restart the situation stabilized.
On the 2, we did further investigation by capturing traffic and looking for the cause of the deletion of the "mediawiki_cirrussearch_frozen_indexes" index. We saw a high number of deletion and creation requests for this index, coming from mediawiki job runners.
Further investigation show that Elastica (the library used by CirrusSearch to communicate with Elasticsearch) does a recreate (deletion followed by creation) to create a new index [4]. This was fixed quickly by disabling index creation [5][6]. A more permanent fix is tracked [7].
We suspect that this issue was seen only on codfw because of the latency increasing the probability of a race condition between 2 re-create operations.
Lessons learned: * index creation / deletion can bring the cluster to its knees * despite that, Elasticsearch is robust, client request do not seem to have been affected * the number of pending tasks raising seems to be a god indication of cluster issues [8]
Big thanks to David and Erik for their support in this issue!
[1] https://phabricator.wikimedia.org/T110236 [2] https://gerrit.wikimedia.org/r/#/c/285620/ [3] https://gerrit.wikimedia.org/r/#/c/285612/ [4] https://github.com/ruflin/Elastica/blob/master/lib/Elastica/Index.php#L238 - not entirely sure about this reference [5] https://gerrit.wikimedia.org/r/#/c/286541/ [6] https://gerrit.wikimedia.org/r/#/c/286542/ [7] https://phabricator.wikimedia.org/T133793 [8] https://phabricator.wikimedia.org/T134240