Reminder: The first of these outages
will start in about 30 minutes. Toolforge NFS will be read-only
for as long as 18-19 hours.
There
will be two major Toolforge
outages this coming week. Each outage will cause tool
downtime and may require manual restarts afterwards.
The
first outage is an NFS migration [0] and will take place on
Monday, beginning at around 0:00 UTC and lasting well into
the day, possibly as late as 19:00 UTC. During this long
period, Toolforge
NFS will be read-only. This will cause most tools (for
example, anything that writes a log file) to fail.
The
second outage will be a database migration [1] and will take
place on Thursday at 17:00UTC. During this window ToolsDB will be
read-only. This migration should take about an hour but
unexpected side-effects may extend the downtime.
We try
very hard to avoid outages of this magnitude, but at this
point we need to choose downtime over the increasing risk of
data loss.
More
details can be found below.
[0] NFS
Outage and system reboots Monday: The existing toolforge NFS
server is running on aging hardware and lacks a
straightforward path for maintenance or upgrading. To
improve this we are moving NFS to a cinder+VM platform which
should support easier upgrades, migrations, and expansions
in the future. In order to maintain data integrity during
the migration, the old server will need to be made read-only
while the last set of file changes is synchronized with the
new server. Because the NFS service is actively used, it
will take many hours to complete the final sync.
To
ensure stable mounts of the new server, every node in Toolforge
will be rebooted as part of this migration. That means that
even tools which do not use NFS will be affected, although
most tools should restart gracefully.
[1] DB
outage Thursday: As part
of the ongoing effort to
upgrade user-created Toolforge
databases, we will
migrate ToolsDB to a new VM that will have a more recent
version of Debian and MariaDB and will use a more resilient
storage solution.
The new
VM is ready, and we plan to point all tools to use it on Apr,
6 2023 at 17:00 UTC.
This
will involve about 1 hour of read-only time
for the database. Any existing database connection will be
terminated, and if your tool does not reconnect
automatically you might have to restart it manually.
An email
will be sent shortly before starting the migration, and when
it's finished.
Please
also make sure your tool is connecting to the database using
the canonical hostname tools.db.svc.wikimedia.cloud
and not any other hostname or IP address.