Elasticsearch cluster restart more eventful than expected - Discovery

1 Mar 2018

Hello all!

Usually the main problem in an elasticsearch cluster restart is that
it is a long and boring operation. This week, I ran into a number of
mostly minor issues. Probably not enough for an incident report, but
at least an email is needed. So here is some info on what went wrong:

First, my apologies for the noise caused by this restart! I'd love to
tell you that this won't happen again, but sometimes things go
wrong...

1) master re-election took longer than expected (both on codfw and on eqiad)

When a master is restarted, a re-election must occur and the cluster
state must be re-synchronized across the cluster. During that period
of time, both read and write operations can fail. Usually, this is
short enough to be mostly transparent and it is not picked up by our
monitoring (this was the first time that the issue was visible). The
synchronisation of cluster state took 1.9 minute, which while not a
crazy duration is definitely too long.

It would be great to be able to preemptively trigger a smooth master
re-election, but that's not something that elasticsearch supports,
despite multiple requests.

Potential solutions:

* have dedicated masters, which do not serve data at all: this could
ensure more resources are available for master / cluster wide
operations, but would probably not reduce the time for synchronization
of cluster state
* reduce the size of the cluster state: this would imply reducing the
number of indices, probably by splitting the cluster in multiple
independent clusters, with additional routing and general maintenance
burden.

2) indexing lag had impact on Wikidata

All (well, most, there are a few exceptions) writes to elasticsearch
are asynchronous and go through the job queue. During maintenance, we
freeze writes, restart nodes, wait for at least partial recovery,
re-enable writes, wait for full recovery, move to the next nodes.
Recovering shards which have been written to means transferring the
full shard over the network, which is much slower than recovering from
local storage (note that ES6 should enable partial shard recovery
making this issue much easier to deal with).

For most wikis, having lag on the indexing is not an issue. A new page
taking some time to be searchable is OK. An update to a page is even
less of an issue, since in most case an edit will affect the ranking,
which is already a heuristic.

Wikidata has a very different workflow, where editors add statements
(or properties or items, I'm not fluent in the wikidata terminology)
in sequence, referencing the just created statements. Searching for
those statements immediately is part of the natural wikidata workflow.

Potential solutions:

* writes are always going to be asynchronous, with potential lag
* we have a patch [1] coming up to trigger synchronous updates in the
nominal case (with still the appropriate asynchronous failover if need
be), but that is not going to help in case of planned maintenance.
* a better feedback in the UI, warning the user that the lag is higher
than usual would be nice
* I should better communicate to the wikidata community when doing maintenance

3) unassigned shard check was triggered

We have an Icinga check that raises an alert when more than 10% of
shards are unassigned. When a node is down, the shards that it hosted
go unassigned, until the shards are re-assigned, either on the same
node once it is back up, or moved to another node in the cluster.

I reboot elasticsearch servers by groups of 3. On a 36 nodes cluster,
this usually means that less than 10% of the shards are lost. But the
shards are not perfectly balanced. Other considerations, like the size
of those shards are taken into account. We also had one less node in
the cluster (elastic1021 has memory issues [2]).

This check is a heuristic. It hard to know at which point we are
really in trouble. In this case, the cluster was still fine, and
enough shards were recovered to silence the check in a few minutes.

Potential solutions:

* not much... we might want to raise this check to 12% instead of 10%,
but it is likely that we will still get a false positive at some point
(note that this is the first time it the 2 years I've been here that
this check sends a false positive, not too bad).

If you are still reading, thanks a lot for your patience and interest!

Note that the cluster restart is still ongoing on eqiad. I'm not
planning on anything else going wrong, but who knows... I might have
an addendum to this email before the weekend.

Have fun!

   Guillaume

[1] https://gerrit.wikimedia.org/r/#/c/413492/
[2] https://phabricator.wikimedia.org/T188595

-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST