Instabilities of Elasticsearch codfw cluster - Discovery

3 May 2016

We have recently experienced multiple instabilities of our
elasticsearch cluster in codfw. First one on 2016-04-27 identified
around 10am UTC, second one on 2016-05-02 identified around 11pm UTC.

In both cases the symptoms were similar:

* a cluster restart was in progress to modify cluster discovery
strategy (moving from multicast to unicast [1]).
* cluster wide operations (get list of nodes, shards, changing cluster
settings) were extremely slow
* number of pending tasks (`curl -s
localhost:9200/_cluster/health?pretty | jq .number_of_pending_tasks`)
was high
* most of those tasks were deletion of shards (`curl -s
localhost:9200/_cat/pending_tasks | grep indices_store | wc -l`)
* a few of them were deletion and creation of the
"mediawiki_cirrussearch_frozen_indexes" index
* response time of client requests did not seem to be affected

Only "more like" traffic is going to codfw. On the 27, we switched
this traffic to eqiad to buy time for investigation and recovery [2].
We found a copy/paste error in the cluster discovery configuration
[3]. After a full cluster restart the situation stabilized.

On the 2, we did further investigation by capturing traffic and
looking for the cause of the deletion of the
"mediawiki_cirrussearch_frozen_indexes" index. We saw a high number of
deletion and creation requests for this index, coming from mediawiki
job runners.

Further investigation show that Elastica (the library used by
CirrusSearch to communicate with Elasticsearch) does a recreate
(deletion followed by creation) to create a new index [4]. This was
fixed quickly by disabling index creation [5][6]. A more permanent fix
is tracked [7].

We suspect that this issue was seen only on codfw because of the
latency increasing the probability of a race condition between 2
re-create operations.

Lessons learned:
* index creation / deletion can bring the cluster to its knees
* despite that, Elasticsearch is robust, client request do not seem to
have been affected
* the number of pending tasks raising seems to be a god indication of
cluster issues [8]

Big thanks to David and Erik for their support in this issue!

[1] https://phabricator.wikimedia.org/T110236
[2] https://gerrit.wikimedia.org/r/#/c/285620/
[3] https://gerrit.wikimedia.org/r/#/c/285612/
[4] https://github.com/ruflin/Elastica/blob/master/lib/Elastica/Index.php#L238
- not entirely sure about this reference
[5] https://gerrit.wikimedia.org/r/#/c/286541/
[6] https://gerrit.wikimedia.org/r/#/c/286542/
[7] https://phabricator.wikimedia.org/T133793
[8] https://phabricator.wikimedia.org/T134240

-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation