Hi,
today 2019-09-30 we were doing an operation in all CloudVPS virtual machines to
update ferm to fix a bug [0]. Ferm is a firewalling utility.
The fleet-wide operation resulted in ferm being installed in every VM, even in
those VMs not requiring it. This resulted in a network outage for most of the
virtual machines and projects that were not previously configured to use ferm.
Many Toolforge tools (webservices, grid jobs, etc) stopped working, database
connection were lost, proxy reported bad gateway errors, etc.
To resolve the issue, we quickly removed ferm from every VM and run puppet agent
to install it just in the VMs that had ferm in their puppet manifests.
As soon as we did this, everything went back to normal.
This incident lasted 1h, give or take.
Please, get in contact in case you see any issue or have any doubts about this
incident.
regards.
[0] https://phabricator.wikimedia.org/T153468
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation