Did you wonder how 8 shards * 2 replicas makes 24 total shards? Well,On Fri, Jun 30, 2017 at 7:13 PM, Guillaume Lederrey
<glederrey@wikimedia.org> wrote:
> Hello!
>
> We've had a significant slowdown of elasticsearch today (see Grafana
> for exact timing [1]). The impact was low enough that it probably does
> not require a full incident report (the number of errors did not raise
> significantly [2]), but understanding what happened and sharing that
> understanding is important. This is going to be a long and technical
> email, you might get bored, feel free to close it and delete it right
> now.
>
> TL;DR: elastic1019 was overloaded, having too many heavy shards,
> banning all shards from elastic1019 to reshuffle allowed it to
> recover.
>
> In more details:
>
> elastic1019 was hosting shards for commonswiki, enwiki and frwiki,
> which are all high load shards. elastic1019 is one of our older
> server, which are less powerfull, and might also suffer from CPU
> overheating [3].
>
> The obvious question: "why do we even allow multiple heavy shards to
> be allocated on the same node?". The answer is obvious as well: "it's
> complicated...".
>
> One of the very interesting feature of elasticsearch is its ability to
> automatically balance shards. This allows the cluster to automatically
> rebalance in case nodes are lost, and to automatically balance shards
> to spread resource usage across all nodes in the cluster [4].
> Constraints can be added to account for available disk space [5], rack
> awareness [6], or even have specific filtering for specific indices
> [7]. It does not directly allow to constraint allocation based on the
> load of a specific shard.
>
> We do have a few mechanism to ensure that load is as uniform as
> possible on the cluster:
>
> An index is split in multiple shards, a shard is replicated multiple
> times to provide redundancy and to spread load. Those are configured
> by index.
>
> We know which are the heavy indices (commons, enwiki, frwiki, ...),
> both in term of size and in term of traffic. Those indices are split
> in a number of shards+replicas close to the number of nodes in the
> cluster, to ensure that those shards are spread evenly on the cluster,
> with only a few shards of the same index on the same node, but still
> allow to loose a few nodes and keep all shards allocated. For example,
> enwiki_content has 8 shards, with 2 replicas each, so a total number
> of 24 shards, with a maximum of 2 shards on the same node. This
> approach works well most of the time.
it is actually 8 shards * (1 primary + 2 replicas). And now it make
sense!
> The limitation is that a shard is a "scalability unit", you can't move
> around something smaller than a shard. In the case of enwiki, a single
> shard is ~40Go and a fairly large number of requests per second. If
> you have a node that has just one more of those shards, that's already
> a significant amount of additional load.
>
> The solution could be to split large indices in a lot more shards, the
> scalability unit would be much smaller, and it would be much easier to
> have a uniform load. Of course, there are also limitations. The total
> number of shards in the cluster has a significant cost. Increasing it
> will add load to cluster operations (which are already quite expensive
> in with the total number of shards we have at this point). There are
> also functional issues: ranking (BM25) uses statistics calculated per
> shard, with smaller shards at some point the stats might not be
> relevant of the whole corpus.
>
> There are probably a lot more detail we could get into, feel free to
> ask more questions and we can continue the conversation. And I'm sure
> David and Erik have a lot to add!
>
> Thanks for reading to the end!
>
> Guillaume
>
>
>
>
>
> [1] https://grafana.wikimedia.org/dashboard/db/elasticsearch- percentiles?orgId=1&panelId= 20&fullscreen&from= 1498827816290&to=1498835616852
> [2] https://grafana.wikimedia.org/dashboard/db/elasticsearch- percentiles?orgId=1&panelId=9& fullscreen&from=1498827816290& to=1498835616852
> [3] https://phabricator.wikimedia.org/T168816
> [4] https://www.elastic.co/guide/en/elasticsearch/reference/ current/shards-allocation.html
> [5] https://www.elastic.co/guide/en/elasticsearch/reference/ current/disk-allocator.html
> [6] https://www.elastic.co/guide/en/elasticsearch/reference/ current/allocation-awareness. html
> [7] https://www.elastic.co/guide/en/elasticsearch/reference/ current/shard-allocation- filtering.html
>
>
> --
> Guillaume Lederrey
> Operations Engineer, Discovery
> Wikimedia Foundation
> UTC+2 / CEST
--
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST
_______________________________________________
discovery mailing list
discovery@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery