Hi,
today 2019-09-30 we were doing an operation in all CloudVPS virtual machines to update ferm to fix a bug [0]. Ferm is a firewalling utility.
The fleet-wide operation resulted in ferm being installed in every VM, even in those VMs not requiring it. This resulted in a network outage for most of the virtual machines and projects that were not previously configured to use ferm. Many Toolforge tools (webservices, grid jobs, etc) stopped working, database connection were lost, proxy reported bad gateway errors, etc.
To resolve the issue, we quickly removed ferm from every VM and run puppet agent to install it just in the VMs that had ferm in their puppet manifests. As soon as we did this, everything went back to normal. This incident lasted 1h, give or take.
Please, get in contact in case you see any issue or have any doubts about this incident.
regards.
[0] https://phabricator.wikimedia.org/T153468
cloud-announce@lists.wikimedia.org