Hello all!
Usually the main problem in an elasticsearch cluster restart is that it is a long and boring operation. This week, I ran into a number of mostly minor issues. Probably not enough for an incident report, but at least an email is needed. So here is some info on what went wrong:
First, my apologies for the noise caused by this restart! I'd love to tell you that this won't happen again, but sometimes things go wrong...
1) master re-election took longer than expected (both on codfw and on eqiad)
When a master is restarted, a re-election must occur and the cluster state must be re-synchronized across the cluster. During that period of time, both read and write operations can fail. Usually, this is short enough to be mostly transparent and it is not picked up by our monitoring (this was the first time that the issue was visible). The synchronisation of cluster state took 1.9 minute, which while not a crazy duration is definitely too long.
It would be great to be able to preemptively trigger a smooth master re-election, but that's not something that elasticsearch supports, despite multiple requests.
Potential solutions:
* have dedicated masters, which do not serve data at all: this could ensure more resources are available for master / cluster wide operations, but would probably not reduce the time for synchronization of cluster state * reduce the size of the cluster state: this would imply reducing the number of indices, probably by splitting the cluster in multiple independent clusters, with additional routing and general maintenance burden.
2) indexing lag had impact on Wikidata
All (well, most, there are a few exceptions) writes to elasticsearch are asynchronous and go through the job queue. During maintenance, we freeze writes, restart nodes, wait for at least partial recovery, re-enable writes, wait for full recovery, move to the next nodes. Recovering shards which have been written to means transferring the full shard over the network, which is much slower than recovering from local storage (note that ES6 should enable partial shard recovery making this issue much easier to deal with).
For most wikis, having lag on the indexing is not an issue. A new page taking some time to be searchable is OK. An update to a page is even less of an issue, since in most case an edit will affect the ranking, which is already a heuristic.
Wikidata has a very different workflow, where editors add statements (or properties or items, I'm not fluent in the wikidata terminology) in sequence, referencing the just created statements. Searching for those statements immediately is part of the natural wikidata workflow.
Potential solutions:
* writes are always going to be asynchronous, with potential lag * we have a patch [1] coming up to trigger synchronous updates in the nominal case (with still the appropriate asynchronous failover if need be), but that is not going to help in case of planned maintenance. * a better feedback in the UI, warning the user that the lag is higher than usual would be nice * I should better communicate to the wikidata community when doing maintenance
3) unassigned shard check was triggered
We have an Icinga check that raises an alert when more than 10% of shards are unassigned. When a node is down, the shards that it hosted go unassigned, until the shards are re-assigned, either on the same node once it is back up, or moved to another node in the cluster.
I reboot elasticsearch servers by groups of 3. On a 36 nodes cluster, this usually means that less than 10% of the shards are lost. But the shards are not perfectly balanced. Other considerations, like the size of those shards are taken into account. We also had one less node in the cluster (elastic1021 has memory issues [2]).
This check is a heuristic. It hard to know at which point we are really in trouble. In this case, the cluster was still fine, and enough shards were recovered to silence the check in a few minutes.
Potential solutions:
* not much... we might want to raise this check to 12% instead of 10%, but it is likely that we will still get a false positive at some point (note that this is the first time it the 2 years I've been here that this check sends a false positive, not too bad).
If you are still reading, thanks a lot for your patience and interest!
Note that the cluster restart is still ongoing on eqiad. I'm not planning on anything else going wrong, but who knows... I might have an addendum to this email before the weekend.
Have fun!
Guillaume
[1] https://gerrit.wikimedia.org/r/#/c/413492/ [2] https://phabricator.wikimedia.org/T188595