Hello all!
We had an interesting discussion yesterday with David about the way we do sharding of our indices on elasticsearch. Here are a few notes for whoever finds the subject interesting and wants to jump in the discussion:
Context:
We recently activated row aware shard allocation on our elasticsearch search clusters. This means that we now have one additional constraint on shard allocation: spread copies of shards across multiple datacenter rows, so that if we loose a full row, we still have a copy of all the data. During an upgrade of elasticsearch, another constraint comes into play: a shard can move from a node with an older version of elasticsearch to a node with a newer version, but not the other way around. This leads to elasticsearch struggling to allocate all shards during the recent codfw upgrade to elasticsearch 2.3.5. While it is not the end of the world (we can still server traffic if some indices don't have all shards allocated), this is something we need to improve.
Number of shards / number of replicas:
An elasticsearch index is split at creation in a number of shards. A number of replica per shard is configured [1]. The total number of shards for an index is "number_of_shards * (number_of_replicas + 1)". Increasing the number of shards per index allow to execute read operation in parallel over the different shards and aggregate the results at the end, improving response time Increasing the number of replicas allow to distribute the read load over more nodes (and provides some redundancy in case we loose one server). As term frequency [2] is calculated over a shard and not over the full index,
There is some black magic involved in how we shard our indices, but most of it is documented [3]
The enwiki_content example:
enwiki_content index is configured to have 6 shards and 3 replicas, for a total number of 24 shards. It also has the additional constraint that there is at most 1 enwiki_content per node. This ensures a maximum spread of enwiki_content shards over the cluster. Since enwiki_content is one of the index with the most traffic, this ensure that the load is well distributed over the cluster.
Now the bad news: for codfw, which is a 24 node cluster, it means that reaching this perfect equilibrium of 1 shard per node is a serious challenge if you take into account the other constraint in place. Even with relaxing the constraint to 2 enwiki shards per node, we have seen unassigned shards during elasticsearch upgrade.
Potential improvements:
While ensuring that a large index has a number of shards close to the number of nodes in the cluster allows for optimally spreading load over the cluster, it degrade fast if all the stars are not aligned perfectly. There are 2 opposite solutions
1) decrease the number of shards to leave some room to move them around 2) increase the number of shards and allow multiple shards of the same index to be allocated on the same node
1) is probably impractical on our large indices, enwiki_content shards are already ~30Gb and this makes it impractical to move them around during relocation and recovery
2) is probably our best bet. More smaller shards means that a single query load will be spread over more nodes, potentially improving response time. Increasing number of shards for enwiki_content from 6 to 20 (total shards = 80) means we have 80 / 24 = 3.3 shards per node. Removing the 1 shards per node constraint and letting elasticsearch spread the shards as best as it can means that in case 1 node is missing, or during an upgrade, we still have the ability to move shards around. Increasing this number even more might help keep the load evenly spread across the cluster (the difference between 8 or 9 shards per node is smaller than the difference between 3 or 4 shards per node).
David is going to do some tests to validate that those smaller shards don't impact the scoring (smaller shards mean worse frequency analysis).
I probably forgot a few points, but this email is more than long enough already...
Thanks to all of you who kept reading until the end!
MrG
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_conce... [2] https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.h... [3] https://wikitech.wikimedia.org/wiki/Search#Estimating_the_number_of_shards_r...
Le 16/09/2016 à 16:28, Guillaume Lederrey a écrit :
[...]
The enwiki_content example:
enwiki_content index is configured to have 6 shards and 3 replicas, for a total number of 24 shards. It also has the additional constraint that there is at most 1 enwiki_content per node. This ensures a maximum spread of enwiki_content shards over the cluster. Since enwiki_content is one of the index with the most traffic, this ensure that the load is well distributed over the cluster.
side note: in mediawiki config we updated shard count to 6 for enwiki_content and set replica count to 4 for eqiad. This is still not effective since we haven't rebuilt the eqiad enwiki index yet. In short: - eqiad effective settings for enwiki: 7*(3r+1p) => 28 shards - eqiad future settings after a reindex: 6*(4r+1p) => 30 shards - codfw for enwiki: 6*(3r+1p) => 28 shards
Now the bad news: for codfw, which is a 24 node cluster, it means that reaching this perfect equilibrium of 1 shard per node is a serious challenge if you take into account the other constraint in place. Even with relaxing the constraint to 2 enwiki shards per node, we have seen unassigned shards during elasticsearch upgrade.
Potential improvements:
While ensuring that a large index has a number of shards close to the number of nodes in the cluster allows for optimally spreading load over the cluster, it degrade fast if all the stars are not aligned perfectly. There are 2 opposite solutions
- decrease the number of shards to leave some room to move them around
- increase the number of shards and allow multiple shards of the same
index to be allocated on the same node
- is probably impractical on our large indices, enwiki_content shards
are already ~30Gb and this makes it impractical to move them around during relocation and recovery
I'm leaning towards 1, our shards are very big I agree and it takes a non negligible time to move them around. But we also noticed that the number of indices/shards is also a source of pain for the master. I don't think we should reduce the number of primary shards, I'm more for reducing the number of replicas. Historically I think enwiki has been set to 3 replicas for performance reasons, not really for HA reasons. Now that we moved all the prefix queries to a dedicated index I'd be curious to see if we can serve fulltext queries for enwiki with only 2 replicas: 7*(2r+1p) => 21 shards total I'd be curious to see how the load would look like if we isolate autocomplete queries. I think option 1 is more about how to trade HA vs shard count vs perf. Another option would be 10*(1r+1p) => 20 smaller shards, we divide by 2 the total size required to store enwiki. But losing only 2 nodes can cause enwiki to be red (partial results) vs 3 nodes today.
- is probably our best bet. More smaller shards means that a single
query load will be spread over more nodes, potentially improving response time. Increasing number of shards for enwiki_content from 6 to 20 (total shards = 80) means we have 80 / 24 = 3.3 shards per node. Removing the 1 shards per node constraint and letting elasticsearch spread the shards as best as it can means that in case 1 node is missing, or during an upgrade, we still have the ability to move shards around. Increasing this number even more might help keep the load evenly spread across the cluster (the difference between 8 or 9 shards per node is smaller than the difference between 3 or 4 shards per node).
We should be cautious here concerning response times, there are steps in a lucene query that do not really benefit from having more shards. Only collecting docs will really benefit from this, rewrite (scan the lexicon) and rescoring (sorting the topN and then rescore) will add more work if done on more shards. But we can certainly reduce the rescore window with more primaries. Could we estimate how many shards per node we will have in the end with this strategy? Today we have ~370 shards/node on codfw vs ~300 for eqiad.
David is going to do some tests to validate that those smaller shards don't impact the scoring (smaller shards mean worse frequency analysis).
Yes I'll try a 20 primaries enwiki index and see how it works.
I probably forgot a few points, but this email is more than long enough already...
Thanks to all of you who kept reading until the end!
Thanks for writing it!
MrG
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_conce... [2] https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.h... [3] https://wikitech.wikimedia.org/wiki/Search#Estimating_the_number_of_shards_r...
I did read to the end—and though I don't have anything useful to contribute, I appreciate the discussion.
One random thought that's not directly related: I wonder if people who aren't search/database aficionados would appreciate adding Elasticsearch, Lucene, CirrusSearch, nodes, shards, primaries and replicas to the Glossary. They are all things that can be looked up, but a quick description in the glossary could be helpful as a refresher. It's especially helpful when you know the concepts, but you've forgotten which name is which—which one's recall and which one's precision? Which is a shard and which is a node?
Thanks! —Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Sep 16, 2016 at 11:51 AM, David Causse dcausse@wikimedia.org wrote:
Le 16/09/2016 à 16:28, Guillaume Lederrey a écrit :
[...]
The enwiki_content example:
enwiki_content index is configured to have 6 shards and 3 replicas, for a total number of 24 shards. It also has the additional constraint that there is at most 1 enwiki_content per node. This ensures a maximum spread of enwiki_content shards over the cluster. Since enwiki_content is one of the index with the most traffic, this ensure that the load is well distributed over the cluster.
side note: in mediawiki config we updated shard count to 6 for enwiki_content and set replica count to 4 for eqiad. This is still not effective since we haven't rebuilt the eqiad enwiki index yet. In short:
- eqiad effective settings for enwiki: 7*(3r+1p) => 28 shards
- eqiad future settings after a reindex: 6*(4r+1p) => 30 shards
- codfw for enwiki: 6*(3r+1p) => 28 shards
Now the bad news: for codfw, which is a 24 node cluster, it means that
reaching this perfect equilibrium of 1 shard per node is a serious challenge if you take into account the other constraint in place. Even with relaxing the constraint to 2 enwiki shards per node, we have seen unassigned shards during elasticsearch upgrade.
Potential improvements:
While ensuring that a large index has a number of shards close to the number of nodes in the cluster allows for optimally spreading load over the cluster, it degrade fast if all the stars are not aligned perfectly. There are 2 opposite solutions
- decrease the number of shards to leave some room to move them around
- increase the number of shards and allow multiple shards of the same
index to be allocated on the same node
- is probably impractical on our large indices, enwiki_content shards
are already ~30Gb and this makes it impractical to move them around during relocation and recovery
I'm leaning towards 1, our shards are very big I agree and it takes a non negligible time to move them around. But we also noticed that the number of indices/shards is also a source of pain for the master. I don't think we should reduce the number of primary shards, I'm more for reducing the number of replicas. Historically I think enwiki has been set to 3 replicas for performance reasons, not really for HA reasons. Now that we moved all the prefix queries to a dedicated index I'd be curious to see if we can serve fulltext queries for enwiki with only 2 replicas: 7*(2r+1p) => 21 shards total I'd be curious to see how the load would look like if we isolate autocomplete queries. I think option 1 is more about how to trade HA vs shard count vs perf. Another option would be 10*(1r+1p) => 20 smaller shards, we divide by 2 the total size required to store enwiki. But losing only 2 nodes can cause enwiki to be red (partial results) vs 3 nodes today.
- is probably our best bet. More smaller shards means that a single
query load will be spread over more nodes, potentially improving response time. Increasing number of shards for enwiki_content from 6 to 20 (total shards = 80) means we have 80 / 24 = 3.3 shards per node. Removing the 1 shards per node constraint and letting elasticsearch spread the shards as best as it can means that in case 1 node is missing, or during an upgrade, we still have the ability to move shards around. Increasing this number even more might help keep the load evenly spread across the cluster (the difference between 8 or 9 shards per node is smaller than the difference between 3 or 4 shards per node).
We should be cautious here concerning response times, there are steps in a lucene query that do not really benefit from having more shards. Only collecting docs will really benefit from this, rewrite (scan the lexicon) and rescoring (sorting the topN and then rescore) will add more work if done on more shards. But we can certainly reduce the rescore window with more primaries. Could we estimate how many shards per node we will have in the end with this strategy? Today we have ~370 shards/node on codfw vs ~300 for eqiad.
David is going to do some tests to validate that those smaller shards don't impact the scoring (smaller shards mean worse frequency analysis).
Yes I'll try a 20 primaries enwiki index and see how it works.
I probably forgot a few points, but this email is more than long enough already...
Thanks to all of you who kept reading until the end!
Thanks for writing it!
MrG
[1] https://www.elastic.co/guide/en/elasticsearch/reference/curr ent/_basic_concepts.html#_shards_amp_replicas [2] https://www.elastic.co/guide/en/elasticsearch/guide/current/ scoring-theory.html [3] https://wikitech.wikimedia.org/wiki/Search#Estimating_the_ number_of_shards_required
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery