(If you don't have the ability to run jobs on our Hadoop/Hive cluster, this will be pretty boring and you don't have to read it)
Hey!
This is an (ir)regularly scheduled reminder that Christian wrote a fantastic guide to what to do if the cluster is stalling - it lives at https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load#What_to_do...
The options it gives you are:
1. Kill jobs that you own; 2. Ask other people to kill jobs they own; 3. Buy more servers.
Unless your name is Nuria or Otto these are probably your only good options; please do not kill other peoples' jobs, particularly jobs marked "root.essential" run by "hdfs". These are (best-case) regularly scheduled analysis and (worst-case) actual ETL and data consumption tasks, which then have to be re-run.
If you notice the cluster is stalling and stopping your jobs doesn't do anything, the #wikimedia-analytics IRC channel is probably your best bet. On the weekdays, throwing a message in will probably be enough. On the weekends, target it at one of the analytics engineers. If none of that works, the mailing lists are also good. And if it's ultra-critical, physically poke or phone someone.
<eom>