Toolforge just now suffered a partial grid-engine outage. All grid services should be back to normal as of this email; some k8s services may misbehave for the next hour or two.
NFS misbehavior resulted in grid control mechanisms timing out, which meant that no new jobs could be scheduled for the last 90 minutes or so. We've rebooted the NFS server which has resolved the primary issues; however, rebooting NFS is itself disruptive and may have caused other jobs (both on the grid or in k8s) to fail.
We're currently rebooting all k8s worker nodes, which will take a couple of hours to complete. During those reboots some jobs may fail or experience surprise rescheduling.
Sorry for the outage! If your grid job was disrupted by this outage, please take this as a sign to migrate your service off the grid! Details about the grid shutdown can be found here: https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#T...
-Andrew (+ Taavi who did most of the actual recovery work)
cloud-announce@lists.wikimedia.org