Hi,
TL;DR: If you think your Hive queries are currently taking longer than usual, please find qchris in IRC, and if he is not responsive, kindly ask someone with root on stat1002 (like Ops) to kill the process
java -Dproc_balancer -Xmx1000m [...]
-----------------------------------------------------
Data in the Analytics cluster is not evenly distributed. Some data nodes are >90% full, while others are half empty.
Data nodes that are >90% full are considered unhealthy and no longer contribute to the pool of available resources. So unhealty data nodes no longer contribute to the total available memory in the cluster.
There are other motivations too, but the latter item alone is enough motivation to keep the data nodes balanced and hence healthy.
Rebalancing is running since 2015-02-26, but situation is getting worse quicker than rebalancing can rebalance.
We've been up to 5 unhealthy nodes. Since we're missing their memory, I decided that we should rebalance more aggressively. Hence, I bumped the rebalancer's capacity, and nodes are recovering and getting healthy again.
I am monitoring the increased-capacity rebalancer closely, but in case you're getting blocked by it without me noticing, please find me in IRC and let me know, so I can turn the rebalancer's capacity down. Or if you find me unresponsive, please find someone with root on stat1002 (like Ops) and ask thon to kill the process
java -Dproc_balancer -Xmx1000m [...]
on stat1002.
Have fun, Christian