Unfortunately, the tools NFS server was affected, and some processes got hung (see https://phabricator.wikimedia.org/T257945).
We are restarting some of the toolforge nodes to get them unstuck, we'll get things back in a fully working state soon.
On Mon, 2023-05-15 at 17:34 +0200, David Caro wrote:
Update
The restart was not as smooth as expected, and after a minute of trying to shut the ports down to prepare for it, the storage cluster started having trouble and that caused some instability for the virtual machines (some slow disk writes, some network connectivity issues, ...).
Everything should be back up and we will gather and investigate the incident to avoid it from happening in the future.
Let us know if you are still seeing issues by opening a ticket or pinging us on IRC:
https://wikitech.wikimedia.org/wiki/Help:Toolforge#Communication_and_support
Thanks for your patience!
On Mon, 2023-05-15 at 13:55 +0200, David Caro wrote:
Hi!
We are restarting a switch[1] today at 13:00 UTC.
We are moving all the affected VMs to different hypervisors, and we expect no downtime, though you might experience the servers being a bit unresponsive when the migration finally moves the VM (a couple seconds).
We will reply to this email once it's done.
Thanks!
[1]https://phabricator.wikimedia.org/T316544
David Caro SRE - Cloud Services Wikimedia Foundation https://wikimediafoundation.org/ PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3 "Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment."