There is an ongoing outage affecting all cloud vps projects (this includes toolforge and paws) that prevents the machines from getting ip refreshes (dchp client got uninstalled).
We are working on it and the service should be restored soon, will update once everything is up and running.
Working task https://phabricator.wikimedia.org/T347665
Feel free to add a message there if your project is affected, we will make sure to verify that it's back online once we roll out the fix.
Thanks for your patience!
Most of the outage is resolved already, we are finishing up recovering.
CloudVPS projects should be mostly functional: * Most instances have network access at this point. The remaining VMs without network access are currently being fixed. * Most NFS shared storage servers have been rebooted. It's possible some NFS clients have got stuck as a result, we're looking for those but if you have such a project and your client is stuck, rebooting should unblock it.
Toolforge is mostly functional: * NFS had a hiccup and we had to reboot all the worker nodes, the kubernetes side should be fully functional but grid is still restarting web services, we estimate this will take less than an hour from this point.
Paws should be fully functional. Superset should be fully functional.
Will update as soon as everything is fixed (or in a few hours if there are still issues).
Thanks!
On Fri, Sep 29, 2023 at 9:31 AM David Caro dcaro@wikimedia.org wrote:
There is an ongoing outage affecting all cloud vps projects (this includes toolforge and paws) that prevents the machines from getting ip refreshes (dchp client got uninstalled).
We are working on it and the service should be restored soon, will update once everything is up and running.
Working task https://phabricator.wikimedia.org/T347665
Feel free to add a message there if your project is affected, we will make sure to verify that it's back online once we roll out the fix.
Thanks for your patience!
Good news!
All the services should be back up and running!
If you still have issues, please ping us on IRC or open a phabricator task.
Thanks again for your patience!
On Fri, Sep 29, 2023 at 1:11 PM David Caro dcaro@wikimedia.org wrote:
Most of the outage is resolved already, we are finishing up recovering.
CloudVPS projects should be mostly functional:
- Most instances have network access at this point. The remaining VMs
without network access are currently being fixed.
- Most NFS shared storage servers have been rebooted. It's possible some
NFS clients have got stuck as a result, we're looking for those but if you have such a project and your client is stuck, rebooting should unblock it.
Toolforge is mostly functional:
- NFS had a hiccup and we had to reboot all the worker nodes, the
kubernetes side should be fully functional but grid is still restarting web services, we estimate this will take less than an hour from this point.
Paws should be fully functional. Superset should be fully functional.
Will update as soon as everything is fixed (or in a few hours if there are still issues).
Thanks!
On Fri, Sep 29, 2023 at 9:31 AM David Caro dcaro@wikimedia.org wrote:
There is an ongoing outage affecting all cloud vps projects (this includes toolforge and paws) that prevents the machines from getting ip refreshes (dchp client got uninstalled).
We are working on it and the service should be restored soon, will update once everything is up and running.
Working task https://phabricator.wikimedia.org/T347665
Feel free to add a message there if your project is affected, we will make sure to verify that it's back online once we roll out the fix.
Thanks for your patience!
cloud-announce@lists.wikimedia.org