Thanks much Christian for the writeup.
Should have icinga alarms arround these types of issues? Seems like that would be the way to go.
Thanks,
Nuria
On Sat, Mar 7, 2015 at 4:00 PM, Andrew Otto aotto@wikimedia.org wrote:
Thanks Christian!
On Mar 7, 2015, at 09:14, Christian Aistleitner <
christian@quelltextlich.at> wrote:
Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics