July 2021 - Cloud-announce - lists.wikimedia.org

Rebooting login.toolforge.org in 10 minutes
by Brooke Storm 28 Jul '21

28 Jul '21

Since there seems to be some error with sssd (LDAP and name services daemon) on the main Toolforge bastion, I am going to reboot it at 21:33 UTC today. Sorry for the inconvenience. Brooke Storm Staff SRE Wikimedia Cloud Services bstorm(a)wikimedia.org

1 0

2021-07-26@1530 UTC Toolforge Kubernetes Upgrade
by Brooke Storm 23 Jul '21

23 Jul '21

Tools admins will be upgrading Toolforge Kubernetes to version 1.19 on Monday July 26th at 1530UTC to catch up to the upstream release cycle. This should be mostly invisible to end users with the occasional pod restarting. Brooke Storm Staff SRE Wikimedia Cloud Services bstorm(a)wikimedia.org

1 0

2021-07-21@1500 UTC PAWS Kubernetes upgrade
by Brooke Storm 20 Jul '21

20 Jul '21

We will be upgrading PAWS Kubernetes tomorrow at 1500UTC. User impacts should be minimal, but you might see your notebook server stop and restart during the change at some point. Calico (network overlay) may also be upgraded for both paws and tools, but previous upgrades have had no visible user impact at tall, so that should also be quiet and require no user action. Brooke Storm Staff SRE Wikimedia Cloud Services bstorm(a)wikimedia.org

1 0

Database as a Service in Cloud VPS
by Andrew Bogott 19 Jul '21

19 Jul '21

A few weeks ago we rolled out a new service for Cloud VPS users: OpenStack Trove, aka 'Database as a Service.' Trove provides automatic orchestration of stand-alone database instances. In brief, you tell Trove to create a database server with a given size and backend, and it builds and manages the server and provides you with ready-made access links. You can also manage databases and users with Trove, or get a root prompt on the backend itself to create users and databases. We have only tested this a little bit, so I invite anyone with interest to give this a try and let us know what works and what doesn't. There's a longer blog post about this feature here: https://techblog.wikimedia.org/2021/07/19/introducing-database-as-a-service… And some slapdash user documentation here: https://wikitech.wikimedia.org/wiki/Help:Adding_a_Database_to_a_Cloud_VPS_P… Bugs and doc-patches are always welcome! -Andrew + the WMCS team

1 0

Possible network interruptions at 15:00 UTC July 20th, 22nd, 27th, 29th
by Andrew Bogott 19 Jul '21

19 Jul '21

Greetings! Over the next two weeks our network staff will be adjusting and restarting the eqiad network switches. This will affect every server and service running on WMCS, both toolforge and cloud-vps. We don't expect this to result in noticeable downtime, but any connections that are active during the restarts will be interrupted. It's also always possible that some unexpected side-effect will result in a prolonged network outage. One switch will be restarted at 15:00 UTC on July 20th, 22nd, 27th, 29th. The restart on the 27th is the most likely to affect cloud services. To avoid worst-case scenarios the WMCS team will be failing over several services before the restarts. Most of these changes won't be noticeable to users but we'll notify in advance of impact if anything dramatic is expected. -Andrew

1 0

wiki replicas maintenance on 2021-07-22
by Arturo Borrero Gonzalez 19 Jul '21

19 Jul '21

Hi there, on Thurs July 22nd at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST) there is a planned network maintenance that will affect the availability of the wiki replica database service. The expected operation window is of about 5 minutes long and it will affect any wiki replicas users including Toolforge tools, PAWS, and any other Cloud VPS project using them. More information can be found on phabricator: https://phabricator.wikimedia.org/T286614 regards. -- Arturo Borrero Gonzalez SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

2021-07-20@1500 UTC Maps and Scratch NFS briefly unavailable
by Brooke Storm 16 Jul '21

16 Jul '21

Network maintenance will be happening on Tuesday, July 20th at around 1500 UTC that will affect the maps and scratch cluster on both nodes (see https://phabricator.wikimedia.org/T286069 <https://phabricator.wikimedia.org/T286069>). It should be extremely short in duration (measured in seconds, not minutes). Therefore, we will not be failing them over. WMCS will keep an eye on the impact to client VMs and will remediate problems where necessary. If all goes well, most services won’t notice. Brooke Storm Staff SRE Wikimedia Cloud Services bstorm(a)wikimedia.org

1 0

Webservice release changes coming this week
by Brooke Storm 02 Jul '21

02 Jul '21

Hello cloud users, We will be releasing webservice version 0.75 this week to Toolforge. Most of the changes should not be noticed (upgrading to python 3, preparing for grid system upgrades in the future), however, there is a change for some people who have started to use the service.template feature (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Webservice_templates <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Webservice_templates>). This change https://gerrit.wikimedia.org/r/c/operations/software/tools-webservice/+/636… <https://gerrit.wikimedia.org/r/c/operations/software/tools-webservice/+/636…> affects the way webservice templates are found by adding more locations. It will automatically check certain code directories for your template besides $HOME, but it will throw an error if you have a symlink to one of these locations back to your tool’s home directory. The locations besides $HOME that this checks for a service.template are * ~/www/python/src * ~/www/js * ~/public_html If you are affected by that, then simply remove the symlink in $HOME, and it should work fine. If you are not symlinking a service.template back to your tool’s home directory, you should not notice any changes. The full changelog is at https://gerrit.wikimedia.org/r/c/operations/software/tools-webservice/+/700… <https://gerrit.wikimedia.org/r/c/operations/software/tools-webservice/+/700…> When the release is complete, I’ll send a followup email and record it in SAL. Brooke Storm Staff SRE Wikimedia Cloud Services bstorm(a)wikimedia.org

1 1

2021-07-01 scratch and maps NFS maintenance
by Brooke Storm 01 Jul '21

01 Jul '21

The NFS servers used for scratch and maps mounts (/data/project and /home in the maps project and /data/scratch in other projects) will be going offline for a short time tomorrow 2021-07-01 at around 1600 UTC to move the mounts to DRBD synced volumes. The current setup causes odd issues during failover including data loss and stale files left behind. The process taking place is one of those failovers so there may be some files that were previously deleted that need deleting again present and similar anomalies. I plan to reboot the maps project servers to make sure they have their mounts and processes restored as best as possible. The scratch mounts should be less impactful. If you use scratch, just be aware that it will go offline for a bit and will be back with some possible quirks. After that, the data should become far more stable and properly synced between the two systems. The process could start later than 1600 UTC if there are sync issues initially as I try to get as much of the data as possible transferred. More details here https://phabricator.wikimedia.org/T224747 <https://phabricator.wikimedia.org/T224747> Brooke Storm Staff SRE Wikimedia Cloud Services bstorm(a)wikimedia.org

1 2

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-announce July 2021