This went pretty well. I had to reboot three VMs:
wcqs-beta-01.wikidata-query.eqiad1.wikimedia.cloud
maps-wmanew.maps.eqiad1.wikimedia.cloud
tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud
That last one probably caused a few grid jobs to be restarted.
Please let me know if you encounter any bad behavior with this new NFS
mount; it's a test case for future NFS migrations so I'm very interested
in how well this one works.
-Andrew
On 1/18/22 8:59 AM, Andrew Bogott wrote:
Since no one expressed concerns about this, I'm
going to go ahead and
roll this out tomorrow morning at 16:00 UTC. Here's what to expect:
1) If your VM mounts secondary-scratch but doesn't actually use it,
nothing much will happen
2) If your VM or tool has an open file on that volume when the
switchover happens, it will probably freeze up. I will reboot VMs
that this happens to.
3) If you had files on the scratch volume before this change, they
will be gone after the change. Precious files will be recoverable
after the fact for a few weeks.
-Andrew
On 1/14/22 2:06 PM, Andrew Bogott wrote:
Hello, all!
We are in the process of re-engineering and virtualizing[0] the NFS
service provided to Toolforge and VMs. The transition will be rocky
and involve some service interruption... I'm still running tests to
determine exactly host much disruption will be required.
The first volume that I'd like to replace is 'scratch,' typically
mounted as /mnt/nfs/secondary-scratch. I'm seeking feedback about how
vital scratch uptime is to your current workflow, and how disruptive
it would be to lose data there.
If you have a project or tool that uses scratch, please respond with
your thoughts! My preference would be to wipe out all existing data
on scratch and also subject users to several unannounced periods of
downtime, but I also don't want anyone to suffer. If you have
important/persistent data on that volume then the WMCS team will work
with you to migrate that data somewhere safer, and if you have an
important workflow that will break due to Scratch downtime then I'll
work harder on devising a low-impact roll-out.
Thank you!
-Andrew
[0]
https://phabricator.wikimedia.org/T291405
_______________________________________________
Cloud-announce mailing list -- cloud-announce(a)lists.wikimedia.org
List information: