- Cloud-announce - lists.wikimedia.org

Switch restart today at 13:00 UTC, no downtime expected

by David Caro

Hi! We are restarting a switch[1] today at 13:00 UTC. We are moving all the affected VMs to different hypervisors, and we expect no downtime, though you might experience the servers being a bit unresponsive when the migration finally moves the VM (a couple seconds). We will reply to this email once it's done. Thanks! [1]https://phabricator.wikimedia.org/T316544 --- David Caro SRE - Cloud Services Wikimedia Foundation <https://wikimediafoundation.org/> PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3 "Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment."

11 months, 2 weeks

1
2
0 0

PAWS rebuild

by Vivian Rook

PAWS was down and rebuilt. If you had a logged in session you may have to refresh your state by logging out and back in to restart your server. Thank you! -- *Vivian Rook (They/Them)* Site Reliability Engineer Wikimedia Foundation <https://wikimediafoundation.org/>

11 months, 3 weeks

1
0
0 0

PAWS to enforce storage limits 2023-05-15

by Vivian Rook

On 2023-05-15 PAWS will begin enforcing storage limits. https://phabricator.wikimedia.org/T327936 For home directories who's capacity goes above 1 gigabyte, the largest files will be removed to bring the total usage below 1 gigabyte. -- *Vivian Rook (They/Them)* Site Reliability Engineer Wikimedia Foundation <https://wikimediafoundation.org/>

11 months, 4 weeks

1
0
0 0

Toolforge Kubernetes upgrade on 2023-04-03

by Taavi Väänänen

Hi, We will be upgrading the Toolforge Kubernetes cluster next Monday (2023-04-03) starting at around 10:00 UTC. The expected impact is that tools running on the Kubernetes cluster will get restarted a couple of times over the course of the few hours it takes for us to upgrade the entire cluster. The ability to manage tools will remain operational. Since the version we're upgrading to (1.22) removes a bunch of deprecated Kubernetes APIs, tools that use kubectl and raw Kubernetes resources directly may want to check that they're on the latest available versions. The vast majority of tools that are only using the Jobs framework and/or the webservice command are not affected by these changes. Taavi

1 year

2
2
0 0

(Another) toolforge outage coming Thursday

by Andrew Bogott

On Thursday we will be migrating most toolforge databases to a new server. Thiswill take place on Thursday at 17:00UTC. During this window ToolsDBwill be read-only and most tools that rely on writing to the database will fail. This migration should take about an hour but unexpected side-effects may extend the downtime. -- details -- DB outage Thursday: As part of the ongoing effortto upgrade user-created Toolforge databases, we willmigrate ToolsDB to a new VM that will have a more recent version of Debian and MariaDB and will use a more resilient storage solution. The new VM is ready, and we plan to point all tools to use it on *Apr, 6 2023 at 17:00 UTC*. This will involve about *1 hour of read-only time*for the database. Any existing database connection will be terminated, and if your tool does not reconnect automatically you might have to restart it manually. An email will be sent shortly before starting the migration, and when it's finished. Please also make sure your tool is connecting to the database using the canonical hostname *tools.db.svc.wikimedia.cloud*and not any other hostname or IP address. For more details, and to report any issue, you can read or leave a comment at https://phabricator.wikimedia.org/T333471 For more context you can also check out the parent task https://phabricator.wikimedia.org/T301949

1 year

2
2
0 0

PAWS new cluster

by Vivian Rook

PAWS nfs backing was acting up. In repairing that, the existing PAWS cluster locked up. A new cluster was deployed to replace it, if you were running anything in PAWS, it will need restarted. https://phabricator.wikimedia.org/T334140 Thank you! -- *Vivian Rook (They/Them)* Site Reliability Engineer Wikimedia Foundation <https://wikimediafoundation.org/>

1 year

1
0
0 0

Two toolforge outages coming next week, Monday and Thursday

by Andrew Bogott

There will be two major Toolforge outages this coming week. Each outage will cause tool downtime and may require manual restarts afterwards. The first outage is an NFS migration [0] and will take place on Monday, beginning at around 0:00 UTC and lasting well into the day, possibly as late as 19:00 UTC. During this long period, Toolforge NFS will be read-only. This will cause most tools (for example, anything that writes a log file) to fail. The second outage will be a database migration [1] and will take place on Thursday at 17:00UTC. During this window ToolsDBwill be read-only. This migration should take about an hour but unexpected side-effects may extend the downtime. We try very hard to avoid outages of this magnitude, but at this point we need to choose downtime over the increasing risk of data loss. More details can be found below. [0] NFS Outage and system reboots Monday: The existing toolforge NFS server is running on aging hardware and lacks a straightforward path for maintenance or upgrading. To improve this we are moving NFS to a cinder+VM platform which should support easier upgrades, migrations, and expansions in the future. In order to maintain data integrity during the migration, the old server will need to be made read-only while the last set of file changes is synchronized with the new server. Because the NFS service is actively used, it will take many hours to complete the final sync. To ensure stable mounts of the new server, every node in Toolforge will be rebooted as part of this migration. That means that even tools which do not use NFS will be affected, although most tools should restart gracefully. This task is documented as https://phabricator.wikimedia.org/T333477. [1] DB outage Thursday: As part of the ongoing effortto upgrade user-created Toolforge databases, we willmigrate ToolsDB to a new VM that will have a more recent version of Debian and MariaDB and will use a more resilient storage solution. The new VM is ready, and we plan to point all tools to use it on *Apr, 6 2023 at 17:00 UTC*. This will involve about *1 hour of read-only time*for the database. Any existing database connection will be terminated, and if your tool does not reconnect automatically you might have to restart it manually. An email will be sent shortly before starting the migration, and when it's finished. Please also make sure your tool is connecting to the database using the canonical hostname *tools.db.svc.wikimedia.cloud*and not any other hostname or IP address. For more details, and to report any issue, you can read or leave a comment at https://phabricator.wikimedia.org/T333471 For more context you can also check out the parent task https://phabricator.wikimedia.org/T301949

1 year

1
2
0 0

Re: [Cloud] Re: Toolforge Kubernetes upgrade on 2023-04-03 (new date: 2023-04-10)

by Andrew Bogott

On 3/30/23 8:24 AM, Roy Smith wrote: > Just to make sure I'm clear, the downtime announced yesterday > <https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.…> is > still happening? That's correct, the upcoming downtimes are still happening. These three projects are largely unrelated so we're trying to not do them all at the same time. > >> On Mar 30, 2023, at 6:42 AM, Arturo Borrero Gonzalez >> <aborrero(a)wikimedia.org> wrote: >> >> On 3/28/23 00:13, Taavi Väänänen wrote: >>> Hi, >>> We will be upgrading the Toolforge Kubernetes cluster next Monday >>> (2023-04-03) starting at around 10:00 UTC. >>> The expected impact is that tools running on the Kubernetes cluster >>> will get restarted a couple of times over the course of the few >>> hours it takes for us to upgrade the entire cluster. The ability to >>> manage tools will remain operational. >>> Since the version we're upgrading to (1.22) removes a bunch of >>> deprecated Kubernetes APIs, tools that use kubectl and raw >>> Kubernetes resources directly may want to check that they're on the >>> latest available versions. The vast majority of tools that are only >>> using the Jobs framework and/or the webservice command are not >>> affected by these changes. >> >> This has been rescheduled to Monday 2023-04-10 to leave room for the >> other operations we have. >> >> regards. >> >> -- >> Arturo Borrero Gonzalez >> Senior SRE / Wikimedia Cloud Services >> Wikimedia Foundation >> _______________________________________________ >> Cloud-announce mailing list -- cloud-announce(a)lists.wikimedia.org >> List information: >> https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.… > > > _______________________________________________ > Cloud mailing list --cloud(a)lists.wikimedia.org > List information:https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimed…

1 year

1
0
0 0

partial wmcs outage tomorrow, 2022-03-28 between 14:00 and 16:00 UTC

by Andrew Bogott

Due to unavoidable network switch maintenance[0], some WMCS services will be offline briefly tomorrow. The downtime will last for 20-30 minutes and take place sometime between 14:00 and 16:00 UTC. Here is what to expect during the downtime: * *Toolsdb will be unavailable and all queries will fail* * Some of the wiki replica databases may be unavailable * Some DNS servers will be offline; some services may fail to resolve hosts, depending on their fallback logic We anticipate a graceful recovery from this outage, but NFS is fickle so we may need to reboot some or all VMs after the outage. Sorry in advance for any inconvenience or upset emails that result from this maintenance. - Andrew + the WMCS team [0] https://phabricator.wikimedia.org/T330165

1 year, 1 month

1
1
0 0

PAWS k8s upgrade 2023-03-20

by Vivian Rook

PAWS will be switching k8s clusters to get to the latest k8s that openstack currently supports (1.23). This should occur on 2023-03-20 around 13:00 UTC. Anything that was running at the time on the current (old) cluster will need restarted. https://phabricator.wikimedia.org/T328489 -- *Vivian Rook (They/Them)* Site Reliability Engineer Wikimedia Foundation <https://wikimediafoundation.org/>

1 year, 1 month

1
0
0 0

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-announce