There will be two major Toolforge outages this coming week. Each outage will cause tool downtime and may require manual restarts afterwards.
The first outage is an NFS migration [0] and will take place on Monday, beginning at around 0:00 UTC and lasting well into the day, possibly as late as 19:00 UTC. During this long period, Toolforge NFS will be read-only. This will cause most tools (for example, anything that writes a log file) to fail.
The second outage will be a database migration [1] and will take place on Thursday at 17:00UTC. During this window ToolsDBwill be read-only. This migration should take about an hour but unexpected side-effects may extend the downtime.
We try very hard to avoid outages of this magnitude, but at this point we need to choose downtime over the increasing risk of data loss.
More details can be found below.
[0] NFS Outage and system reboots Monday: The existing toolforge NFS server is running on aging hardware and lacks a straightforward path for maintenance or upgrading. To improve this we are moving NFS to a cinder+VM platform which should support easier upgrades, migrations, and expansions in the future. In order to maintain data integrity during the migration, the old server will need to be made read-only while the last set of file changes is synchronized with the new server. Because the NFS service is actively used, it will take many hours to complete the final sync.
To ensure stable mounts of the new server, every node in Toolforge will be rebooted as part of this migration. That means that even tools which do not use NFS will be affected, although most tools should restart gracefully.
This task is documented as https://phabricator.wikimedia.org/T333477.
[1] DB outage Thursday: As part of the ongoing effortto upgrade user-created Toolforge databases, we willmigrate ToolsDB to a new VM that will have a more recent version of Debian and MariaDB and will use a more resilient storage solution.
The new VM is ready, and we plan to point all tools to use it on *Apr, 6 2023 at 17:00 UTC*.
This will involve about *1 hour of read-only time*for the database. Any existing database connection will be terminated, and if your tool does not reconnect automatically you might have to restart it manually.
An email will be sent shortly before starting the migration, and when it's finished.
Please also make sure your tool is connecting to the database using the canonical hostname *tools.db.svc.wikimedia.cloud*and not any other hostname or IP address.
For more details, and to report any issue, you can read or leave a comment at https://phabricator.wikimedia.org/T333471
For more context you can also check out the parent task https://phabricator.wikimedia.org/T301949
_______________________________________________ Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.o...
Reminder: The first of these outages will start in about 30 minutes. Toolforge NFS will be read-only for as long as 18-19 hours.
On 3/29/23 2:17 PM, Andrew Bogott wrote:
There will be two major Toolforge outages this coming week. Each outage will cause tool downtime and may require manual restarts afterwards.
The first outage is an NFS migration [0] and will take place on Monday, beginning at around 0:00 UTC and lasting well into the day, possibly as late as 19:00 UTC. During this long period, Toolforge NFS will be read-only. This will cause most tools (for example, anything that writes a log file) to fail.
The second outage will be a database migration [1] and will take place on Thursday at 17:00UTC. During this window ToolsDBwill be read-only. This migration should take about an hour but unexpected side-effects may extend the downtime.
We try very hard to avoid outages of this magnitude, but at this point we need to choose downtime over the increasing risk of data loss.
More details can be found below.
[0] NFS Outage and system reboots Monday: The existing toolforge NFS server is running on aging hardware and lacks a straightforward path for maintenance or upgrading. To improve this we are moving NFS to a cinder+VM platform which should support easier upgrades, migrations, and expansions in the future. In order to maintain data integrity during the migration, the old server will need to be made read-only while the last set of file changes is synchronized with the new server. Because the NFS service is actively used, it will take many hours to complete the final sync.
To ensure stable mounts of the new server, every node in Toolforge will be rebooted as part of this migration. That means that even tools which do not use NFS will be affected, although most tools should restart gracefully.
This task is documented as https://phabricator.wikimedia.org/T333477.
[1] DB outage Thursday: As part of the ongoing effortto upgrade user-created Toolforge databases, we willmigrate ToolsDB to a new VM that will have a more recent version of Debian and MariaDB and will use a more resilient storage solution.
The new VM is ready, and we plan to point all tools to use it on *Apr, 6 2023 at 17:00 UTC*.
This will involve about *1 hour of read-only time*for the database. Any existing database connection will be terminated, and if your tool does not reconnect automatically you might have to restart it manually.
An email will be sent shortly before starting the migration, and when it's finished.
Please also make sure your tool is connecting to the database using the canonical hostname *tools.db.svc.wikimedia.cloud*and not any other hostname or IP address.
For more details, and to report any issue, you can read or leave a comment at https://phabricator.wikimedia.org/T333471
For more context you can also check out the parent task https://phabricator.wikimedia.org/T301949
_______________________________________________ Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.o...
Hi,
Is there an easy way to stop jsub's "failed to redirect job output" error messages I receive in my mailbox during the NFS outage, ideally, for all of the ~50 jobs I have scheduled for my tools?
Unfortunately, currently I receive dozens of mails every hours.
Martin
On Mon, Apr 3, 2023, 1:24 AM Andrew Bogott abogott@wikimedia.org wrote:
Reminder: The first of these outages will start in about 30 minutes. Toolforge NFS will be read-only for as long as 18-19 hours.
On 3/29/23 2:17 PM, Andrew Bogott wrote:
There will be two major Toolforge outages this coming week. Each outage will cause tool downtime and may require manual restarts afterwards.
The first outage is an NFS migration [0] and will take place on Monday, beginning at around 0:00 UTC and lasting well into the day, possibly as late as 19:00 UTC. During this long period, Toolforge NFS will be read-only. This will cause most tools (for example, anything that writes a log file) to fail.
The second outage will be a database migration [1] and will take place on Thursday at 17:00UTC. During this window ToolsDB will be read-only. This migration should take about an hour but unexpected side-effects may extend the downtime.
We try very hard to avoid outages of this magnitude, but at this point we need to choose downtime over the increasing risk of data loss.
More details can be found below.
[0] NFS Outage and system reboots Monday: The existing toolforge NFS server is running on aging hardware and lacks a straightforward path for maintenance or upgrading. To improve this we are moving NFS to a cinder+VM platform which should support easier upgrades, migrations, and expansions in the future. In order to maintain data integrity during the migration, the old server will need to be made read-only while the last set of file changes is synchronized with the new server. Because the NFS service is actively used, it will take many hours to complete the final sync.
To ensure stable mounts of the new server, every node in Toolforge will be rebooted as part of this migration. That means that even tools which do not use NFS will be affected, although most tools should restart gracefully.
This task is documented as https://phabricator.wikimedia.org/T333477.
[1] DB outage Thursday: As part of the ongoing effort to upgrade user-created Toolforge databases, we will migrate ToolsDB to a new VM that will have a more recent version of Debian and MariaDB and will use a more resilient storage solution.
The new VM is ready, and we plan to point all tools to use it on *Apr, 6 2023 at 17:00 UTC*.
This will involve about *1 hour of read-only time* for the database. Any existing database connection will be terminated, and if your tool does not reconnect automatically you might have to restart it manually.
An email will be sent shortly before starting the migration, and when it's finished.
Please also make sure your tool is connecting to the database using the canonical hostname *tools.db.svc.wikimedia.cloud* and not any other hostname or IP address.
For more details, and to report any issue, you can read or leave a comment at https://phabricator.wikimedia.org/T333471
For more context you can also check out the parent task https://phabricator.wikimedia.org/T301949
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.o... _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Today's migration is complete, and the new server seems to be performing adequately. All volumes are now read/write, and most tools should be running again.
Taavi is in the process of restarting most kubernetes worker nodes which will take another 60 minutes or so. If your tools and jobs are still not running properly after an hour or two then a simple restart should get things back to normal.
Thank you for your patience with this long outage. And remember, a database outage is coming up later in the week!
-Andrew
On 4/2/23 6:24 PM, Andrew Bogott wrote:
Reminder: The first of these outages will start in about 30 minutes. Toolforge NFS will be read-only for as long as 18-19 hours.
On 3/29/23 2:17 PM, Andrew Bogott wrote:
There will be two major Toolforge outages this coming week. Each outage will cause tool downtime and may require manual restarts afterwards.
The first outage is an NFS migration [0] and will take place on Monday, beginning at around 0:00 UTC and lasting well into the day, possibly as late as 19:00 UTC. During this long period, Toolforge NFS will be read-only. This will cause most tools (for example, anything that writes a log file) to fail.
The second outage will be a database migration [1] and will take place on Thursday at 17:00UTC. During this window ToolsDBwill be read-only. This migration should take about an hour but unexpected side-effects may extend the downtime.
We try very hard to avoid outages of this magnitude, but at this point we need to choose downtime over the increasing risk of data loss.
More details can be found below.
[0] NFS Outage and system reboots Monday: The existing toolforge NFS server is running on aging hardware and lacks a straightforward path for maintenance or upgrading. To improve this we are moving NFS to a cinder+VM platform which should support easier upgrades, migrations, and expansions in the future. In order to maintain data integrity during the migration, the old server will need to be made read-only while the last set of file changes is synchronized with the new server. Because the NFS service is actively used, it will take many hours to complete the final sync.
To ensure stable mounts of the new server, every node in Toolforge will be rebooted as part of this migration. That means that even tools which do not use NFS will be affected, although most tools should restart gracefully.
This task is documented as https://phabricator.wikimedia.org/T333477.
[1] DB outage Thursday: As part of the ongoing effortto upgrade user-created Toolforge databases, we willmigrate ToolsDB to a new VM that will have a more recent version of Debian and MariaDB and will use a more resilient storage solution.
The new VM is ready, and we plan to point all tools to use it on *Apr, 6 2023 at 17:00 UTC*.
This will involve about *1 hour of read-only time*for the database. Any existing database connection will be terminated, and if your tool does not reconnect automatically you might have to restart it manually.
An email will be sent shortly before starting the migration, and when it's finished.
Please also make sure your tool is connecting to the database using the canonical hostname *tools.db.svc.wikimedia.cloud*and not any other hostname or IP address.
For more details, and to report any issue, you can read or leave a comment at https://phabricator.wikimedia.org/T333471
For more context you can also check out the parent task https://phabricator.wikimedia.org/T301949
_______________________________________________ Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.o...