This went pretty well. I had to reboot three VMs:

wcqs-beta-01.wikidata-query.eqiad1.wikimedia.cloud
maps-wmanew.maps.eqiad1.wikimedia.cloud
tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud


That last one probably caused a few grid jobs to be restarted.


Please let me know if you encounter any bad behavior with this new NFS mount; it's a test case for future NFS migrations so I'm very interested in how well this one works.


-Andrew


On 1/18/22 8:59 AM, Andrew Bogott wrote:
Since no one expressed concerns about this, I'm going to go ahead and roll this out tomorrow morning at 16:00 UTC.  Here's what to expect:

1) If your VM mounts secondary-scratch but doesn't actually use it, nothing much will happen
2) If your VM or tool has an open file on that volume when the switchover happens, it will probably freeze up.  I will reboot VMs that this happens to.
3) If you had files on the scratch volume before this change, they will be gone after the change. Precious files will be recoverable after the fact for a few weeks.

-Andrew


On 1/14/22 2:06 PM, Andrew Bogott wrote:
Hello, all!

We are in the process of re-engineering and virtualizing[0] the NFS service provided to Toolforge and VMs. The transition will be rocky and involve some service interruption... I'm still running tests to determine exactly host much disruption will be required.

The first volume that I'd like to replace is 'scratch,' typically mounted as /mnt/nfs/secondary-scratch. I'm seeking feedback about how vital scratch uptime is to your current workflow, and how disruptive it would be to lose data there.

If you have a project or tool that uses scratch, please respond with your thoughts! My preference would be to wipe out all existing data on scratch and also subject users to several unannounced periods of downtime, but I also don't want anyone to suffer. If you have important/persistent data on that volume then the WMCS team will work with you to migrate that data somewhere safer, and if you have an important workflow that will break due to Scratch downtime then I'll work harder on devising a low-impact roll-out.

Thank you!

-Andrew

[0] https://phabricator.wikimedia.org/T291405