Phew, ok, things did go wrong! We ran into a couple of bugs recently introduced in Yarn and in Hive that took us a while to find work arounds. Jobs are again flowing through the cluster. However, jobs have been lagging behind since they haven’t been able to run all day. They should eventually catch up. For now, the cluster is back open for business, but I’d appreciate if no one ran any heavy jobs until tomorrow.
Also, it is still possible we may run into other issues we haven’t yet seen, so I can’t guarantee that I won’t have to restart things again.
Anyway, aside from those hiccups. CDH 5.4.0 is now installed, Hive 1.1 and Spark 1.3.0 are now available, weeeeee!
-Ao
On May 4, 2015, at 11:05, Andrew Otto aotto@wikimedia.org wrote:
Hi all, as a reminder, I will be doing this upgrade today. Within the next hour I will turn off the Hadoop cluster. Please do not attempt to use it again until I notify you again.
Thanks! -AO
On Apr 29, 2015, at 14:57, Robert West west@cs.stanford.edu wrote:
All good!
On Wed, Apr 29, 2015 at 11:35 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
- the right research list (Andrew, remove wmfresearch@ from your contact
list :P )
All looks good to me. Thanks. :)
On Wed, Apr 29, 2015 at 1:11 PM, Leila Zia leila@wikimedia.org wrote:
FYI
Ashwin, Bob, Ellery, I don't anticipate this having negative impact on our workflow. If you see possible issues, please communicate with Andrew (cc-ing me), or let me know and I communicate. Thanks!
---------- Forwarded message ---------- From: Andrew Otto aotto@wikimedia.org Date: Wed, Apr 29, 2015 at 11:05 AM Subject: [wmfresearch] Hadoop Cluster Downtime To: Operations Engineers ops@lists.wikimedia.org, "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org, "wmfresearch@lists.wikimedia.org Research" wmfresearch@lists.wikimedia.org
Hi all!
CDH 5.4 is out[1] and we’d like to upgrade. We are doing this now, rather than later, because there is an important Parquet/Hive related bug that has been fixed in this version[2]. This upgrade will include Spark 1.3, which should at least make one researcher happy.
To do this upgrade, I need to schedule some downtime for Hadoop. I’d like to do this on Monday May 4th. I expect the upgrade to take me no more than an hour or two, but just to be safe I’d like to schedule the downtime for the whole day.
If anyone has critical things that they absolutely have to run on Monday, let me know now and I will find another day.
Thanks! -Ao
[1] http://blog.cloudera.com/blog/2015/04/cloudera-enterprise-5-4-is-released/ [2] https://issues.apache.org/jira/browse/HIVE-9482
wmfresearch mailing list wmfresearch@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfresearch
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
-- Up for a little language game? -- http://www.unfun.me
Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops
Hi all,
I’m continuing this thread just because it involves a little more Hadoop downtime.
Ops will be replacing a switch tomorrow. This was previously announced in reference to eventlog1001, but this switch replacement will affect Hadoop as well. During this migration, the ResourceManager will not be reachable, which means that running jobs could likely die. This switch replacement is scheduled to take about 15 minutes, starting at 13:00 UTC tomorrow. Joseph and I will monitor the status of things, and restart any necessary jobs.
Today I worked on getting a High Availability ResourceManager in place (we seem to be needing to restart that thing much more often these days), but I won’t be able to have this installed tomorrow. I foresee a few more cluster restarts in the next week or so, in order to apply this and some other changes. These restarts won’t result in long term cluster downtime like Monday’s upgrade did, but they might cause some disrupted jobs. I will announce any restarts here.
Thanks! -Ao
On May 4, 2015, at 17:42, Andrew Otto aotto@wikimedia.org wrote:
Phew, ok, things did go wrong! We ran into a couple of bugs recently introduced in Yarn and in Hive that took us a while to find work arounds. Jobs are again flowing through the cluster. However, jobs have been lagging behind since they haven’t been able to run all day. They should eventually catch up. For now, the cluster is back open for business, but I’d appreciate if no one ran any heavy jobs until tomorrow.
Also, it is still possible we may run into other issues we haven’t yet seen, so I can’t guarantee that I won’t have to restart things again.
Anyway, aside from those hiccups. CDH 5.4.0 is now installed, Hive 1.1 and Spark 1.3.0 are now available, weeeeee!
-Ao
On May 4, 2015, at 11:05, Andrew Otto aotto@wikimedia.org wrote:
Hi all, as a reminder, I will be doing this upgrade today. Within the next hour I will turn off the Hadoop cluster. Please do not attempt to use it again until I notify you again.
Thanks! -AO
On Apr 29, 2015, at 14:57, Robert West west@cs.stanford.edu wrote:
All good!
On Wed, Apr 29, 2015 at 11:35 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
- the right research list (Andrew, remove wmfresearch@ from your contact
list :P )
All looks good to me. Thanks. :)
On Wed, Apr 29, 2015 at 1:11 PM, Leila Zia leila@wikimedia.org wrote:
FYI
Ashwin, Bob, Ellery, I don't anticipate this having negative impact on our workflow. If you see possible issues, please communicate with Andrew (cc-ing me), or let me know and I communicate. Thanks!
---------- Forwarded message ---------- From: Andrew Otto aotto@wikimedia.org Date: Wed, Apr 29, 2015 at 11:05 AM Subject: [wmfresearch] Hadoop Cluster Downtime To: Operations Engineers ops@lists.wikimedia.org, "A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics." analytics@lists.wikimedia.org, "wmfresearch@lists.wikimedia.org Research" wmfresearch@lists.wikimedia.org
Hi all!
CDH 5.4 is out[1] and we’d like to upgrade. We are doing this now, rather than later, because there is an important Parquet/Hive related bug that has been fixed in this version[2]. This upgrade will include Spark 1.3, which should at least make one researcher happy.
To do this upgrade, I need to schedule some downtime for Hadoop. I’d like to do this on Monday May 4th. I expect the upgrade to take me no more than an hour or two, but just to be safe I’d like to schedule the downtime for the whole day.
If anyone has critical things that they absolutely have to run on Monday, let me know now and I will find another day.
Thanks! -Ao
[1] http://blog.cloudera.com/blog/2015/04/cloudera-enterprise-5-4-is-released/ [2] https://issues.apache.org/jira/browse/HIVE-9482
wmfresearch mailing list wmfresearch@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wmfresearch
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
Research-Internal mailing list Research-Internal@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal
-- Up for a little language game? -- http://www.unfun.me
Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops