Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!
Thanks Christian!
On Mar 7, 2015, at 09:14, Christian Aistleitner christian@quelltextlich.at wrote:
Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks much Christian for the writeup.
Should have icinga alarms arround these types of issues? Seems like that would be the way to go.
Thanks,
Nuria
On Sat, Mar 7, 2015 at 4:00 PM, Andrew Otto aotto@wikimedia.org wrote:
Thanks Christian!
On Mar 7, 2015, at 09:14, Christian Aistleitner <
christian@quelltextlich.at> wrote:
Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Should have icinga alarms arround these types of issues? Seems like that would be the way to go.
Aside from this, I get daily emails about webrequest partition statuses, and I would at least notice the morning after that something is wrong.
On Mar 7, 2015, at 21:20, Nuria Ruiz nuria@wikimedia.org wrote:
Thanks much Christian for the writeup.
Should have icinga alarms arround these types of issues? Seems like that would be the way to go.
Thanks,
Nuria
On Sat, Mar 7, 2015 at 4:00 PM, Andrew Otto <aotto@wikimedia.org mailto:aotto@wikimedia.org> wrote: Thanks Christian!
On Mar 7, 2015, at 09:14, Christian Aistleitner <christian@quelltextlich.at mailto:christian@quelltextlich.at> wrote:
Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at mailto:christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 tel:%2B43%207946%20%2F%2020%205%2081 Fax: +43 7946 / 20 5 81 tel:%2B43%207946%20%2F%2020%205%2081 Homepage: http://quelltextlich.at/ http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Aside from this, I get daily emails about webrequest partition statuses,
and I would at least notice the morning after that something is wrong. Right, but in the case of Friday that would mean perhaps having to backfill a bunch of data up to Saturday morning, whereas if we have alarms we can detect the issue right away and kill jobs as needed.
On Mon, Mar 9, 2015 at 8:55 AM, Andrew Otto aotto@wikimedia.org wrote:
Should have icinga alarms arround these types of issues? Seems like that would be the way to go.
Aside from this, I get daily emails about webrequest partition statuses, and I would at least notice the morning after that something is wrong.
On Mar 7, 2015, at 21:20, Nuria Ruiz nuria@wikimedia.org wrote:
Thanks much Christian for the writeup.
Should have icinga alarms arround these types of issues? Seems like that would be the way to go.
Thanks,
Nuria
On Sat, Mar 7, 2015 at 4:00 PM, Andrew Otto aotto@wikimedia.org wrote:
Thanks Christian!
On Mar 7, 2015, at 09:14, Christian Aistleitner <
christian@quelltextlich.at> wrote:
Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Chris, may I quote your email on BASH?
Pine On Mar 7, 2015 6:14 AM, "Christian Aistleitner" christian@quelltextlich.at wrote:
Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Pine,
On Sat, Mar 07, 2015 at 08:15:18PM -0800, Pine W wrote:
Chris, may I quote your email on BASH?
They take emails too?
Regardless ... feel free to quote or forward any of my emails wherever you seem fit.
Have fun, Christian
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
Christian, may I move this page into the Cluster/Hadoop/Administration page?
Should have icinga alarms arround these types of issues? Seems like that would be the way to go.
We used to have icinga alarms based on webrequest data existence in HDFS. They were very flaky due to the way we had to implement them. Hmm, I suppose we could try to use graphite anomaly detection to alarm on the graph that Qchris mentions.
On Mar 9, 2015, at 09:36, Christian Aistleitner christian@quelltextlich.at wrote:
Hi Pine,
On Sat, Mar 07, 2015 at 08:15:18PM -0800, Pine W wrote:
Chris, may I quote your email on BASH?
They take emails too?
Regardless ... feel free to quote or forward any of my emails wherever you seem fit.
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Andrew,
On Mon, Mar 09, 2015 at 11:54:56AM -0400, Andrew Otto wrote:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
Christian, may I move this page into the Cluster/Hadoop/Administration page?
I think a separate page is worth it as the target audience is different from the Cluster/Hadoop/Administration page.
But sure. Be Bold. Move wherever you seem fit. :-)
Have fun, Christian
I just want to make sure it can be found. I see you added it to the ToC at https://wikitech.wikimedia.org/wiki/Analytics/Cluster, so I think it’ll be fine.
On Mar 9, 2015, at 18:51, Christian Aistleitner christian@quelltextlich.at wrote:
Hi Andrew,
On Mon, Mar 09, 2015 at 11:54:56AM -0400, Andrew Otto wrote:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
Christian, may I move this page into the Cluster/Hadoop/Administration page?
I think a separate page is worth it as the target audience is different from the Cluster/Hadoop/Administration page.
But sure. Be Bold. Move wherever you seem fit. :-)
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
This is really useful, Christian. Thanks for explaining and documenting it.
Leila
On Sat, Mar 7, 2015 at 6:14 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks a lot Christian :) I had not meant by any mean last Friday to overload the cluster ... I did it nonetheless. Your page on how to 'keep an eye on it' will really be useful! Cheers Joseph
On Sun, Mar 8, 2015 at 8:26 PM, Leila Zia leila@wikimedia.org wrote:
This is really useful, Christian. Thanks for explaining and documenting it.
Leila
On Sat, Mar 7, 2015 at 6:14 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”.
But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.”
That's not nice^Wscalable ;-)
So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load
which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do.
Have fun, Christian
P.S.: The above URL has diagrams! Click the URL!
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics