- "Whaddya say we try that again, huh?" - "Yes, yes. Yes. Without the oops."
So, now that I have power and internet again, a reschedule for tomorrow (Thursday) at the same time:
=== Planned outage ===
When: Thursday April 25 at 18:00 UTC Duration: 1 hour
Impact:
* Jobs running on the grid engine will be stopped, and execution nodes will be temporarily disabled; * The login server will be restarted during the window, ending active sessions; * The web service will be unavailable during the maintenance window; and * Running processes not scheduled through the grid engine will be killed.
Recovery plan:
In case of unplanned failure during the maintenance window, configuration will be rolled back to the current version (that is, the gluster-based project storage will remain in place) and a new window will be planned after postmortem.
-- Marc
Hello again,
The maintenance has concluded successfully within the designated, and the Tool Labs instance now use the new NFS server for shared filesystems.
This doubled as a hard test of the continuous bot start/restart system, since the entire cluster was disabled for rolling periods during the maintenance, and the filesystem on which the actual tools were running has been switch underneath them -- pretty much a worst case scenario to recover from.
The result is that all but one tool that had been started as a continuous process restarted cleanly and automatically as the cluster returned to function with the new filesystem (the tool that did not failed to return for an unrelated reason).
Thank you all for your patience!
-- Marc
toolserver-l@lists.wikimedia.org