Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!