(If you don't have the ability to run jobs on our Hadoop/Hive cluster,
this will be pretty boring and you don't have to read it)
Hey!
This is an (ir)regularly scheduled reminder that Christian wrote a
fantastic guide to what to do if the cluster is stalling - it lives at
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load#What_to_d…
The options it gives you are:
1. Kill jobs that you own;
2. Ask other people to kill jobs they own;
3. Buy more servers.
Unless your name is Nuria or Otto these are probably your only good
options; please do not kill other peoples' jobs, particularly jobs
marked "root.essential" run by "hdfs". These are (best-case)
regularly
scheduled analysis and (worst-case) actual ETL and data consumption
tasks, which then have to be re-run.
If you notice the cluster is stalling and stopping your jobs doesn't
do anything, the #wikimedia-analytics IRC channel is probably your
best bet. On the weekdays, throwing a message in will probably be
enough. On the weekends, target it at one of the analytics engineers.
If none of that works, the mailing lists are also good. And if it's
ultra-critical, physically poke or phone someone.
<eom>
--
Oliver Keyes
Research Analyst
Wikimedia Foundation