Hi,
just a quick heads up that the Analytics cluster got stuck today. And jobs deadlocked themselves waiting for other jobs to free resources.
For the time being, to allow the cluster to catch up for the missed hours, I suspended the refining jobs.
This gives the cluster enough resources to catch up with importing the kafka data that it missed during the day.
But this also means that the datasets: pagecounts-all-sites, pagecounts-raw, legacy_tsvs will fall behind a bit, and the wmf.webrequest data will not see new data while the cluster is catching up.
Tomorrow, in the European morning when the cluster has caught up, I'll enable refining again, and the datasets should catch up again.
Sorry for the inconveniences, Christian
P.S.: Suspending refining looks a bit drastic. But if we only killed the resource hungry jobs without stopping refining, refining would start during the catch up of camus and produce faulty datasets. Hence, we suspended refining for now. Tomorrow, we'll resume the suspended jobs and have the datasets catch up again.
P.P.S.: If you have resource hungry jobs on the Analytics cluster, if possible please wait until tomorrow to run them.