Elasticsearch is finally starting to implement adaptive replica selection, where nodes choose replicas to send requests to based on how long the requests take as compared to other replicas. Sadly this doesn't look like it will make it into es6 (at least the initial release). Currently the replica to serve a query is randomly chosen so a node under heavy load doesn't have any way to reduce the pressure.

On Fri, Jun 30, 2017 at 10:27 AM, Guillaume Lederrey <glederrey@wikimedia.org> wrote:
On Fri, Jun 30, 2017 at 7:13 PM, Guillaume Lederrey
<glederrey@wikimedia.org> wrote:
> Hello!
>
> We've had a significant slowdown of elasticsearch today (see Grafana
> for exact timing [1]). The impact was low enough that it probably does
> not require a full incident report (the number of errors did not raise
> significantly [2]), but understanding what happened and sharing that
> understanding is important. This is going to be a long and technical
> email, you might get bored, feel free to close it and delete it right
> now.
>
> TL;DR: elastic1019 was overloaded, having too many heavy shards,
> banning all shards from elastic1019 to reshuffle allowed it to
> recover.
>
> In more details:
>
> elastic1019 was hosting shards for commonswiki, enwiki and frwiki,
> which are all high load shards. elastic1019 is one of our older
> server, which are less powerfull, and might also suffer from CPU
> overheating [3].
>
> The obvious question: "why do we even allow multiple heavy shards to
> be allocated on the same node?". The answer is obvious as well: "it's
> complicated...".
>
> One of the very interesting feature of elasticsearch is its ability to
> automatically balance shards. This allows the cluster to automatically
> rebalance in case nodes are lost, and to automatically balance shards
> to spread resource usage across all nodes in the cluster [4].
> Constraints can be added to account for available disk space [5], rack
> awareness [6], or even have specific filtering for specific indices
> [7]. It does not directly allow to constraint allocation based on the
> load of a specific shard.
>
> We do have a few mechanism to ensure that load is as uniform as
> possible on the cluster:
>
> An index is split in multiple shards, a shard is replicated multiple
> times to provide redundancy and to spread load. Those are configured
> by index.
>
> We know which are the heavy indices (commons, enwiki, frwiki, ...),
> both in term of size and in term of traffic. Those indices are split
> in a number of shards+replicas close to the number of nodes in the
> cluster, to ensure that those shards are spread evenly on the cluster,
> with only a few shards of the same index on the same node, but still
> allow to loose a few nodes and keep all shards allocated. For example,
> enwiki_content has 8 shards, with 2 replicas each, so a total number
> of 24 shards, with a maximum of 2 shards on the same node. This
> approach works well most of the time.

Did you wonder how 8 shards * 2 replicas makes 24 total shards? Well,
it is actually 8 shards * (1 primary + 2 replicas). And now it make
sense!

> The limitation is that a shard is a "scalability unit", you can't move
> around something smaller than a shard. In the case of enwiki, a single
> shard is ~40Go and a fairly large number of requests per second. If
> you have a node that has just one more of those shards, that's already
> a significant amount of additional load.
>
> The solution could be to split large indices in a lot more shards, the
> scalability unit would be much smaller, and it would be much easier to
> have a uniform load. Of course, there are also limitations. The total
> number of shards in the cluster has a significant cost. Increasing it
> will add load to cluster operations (which are already quite expensive
> in with the total number of shards we have at this point). There are
> also functional issues: ranking (BM25) uses statistics calculated per
> shard, with smaller shards at some point the stats might not be
> relevant of the whole corpus.
>
> There are probably a lot more detail we could get into, feel free to
> ask more questions and we can continue the conversation. And I'm sure
> David and Erik have a lot to add!
>
> Thanks for reading to the end!
>
>    Guillaume
>
>
>
>
>
> [1] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&panelId=20&fullscreen&from=1498827816290&to=1498835616852
> [2] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?orgId=1&panelId=9&fullscreen&from=1498827816290&to=1498835616852
> [3] https://phabricator.wikimedia.org/T168816
> [4] https://www.elastic.co/guide/en/elasticsearch/reference/current/shards-allocation.html
> [5] https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html
> [6] https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-awareness.html
> [7] https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-allocation-filtering.html
>
>
> --
> Guillaume Lederrey
> Operations Engineer, Discovery
> Wikimedia Foundation
> UTC+2 / CEST



--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST

_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery