Since there seems to be some error with sssd (LDAP and name services daemon) on the main Toolforge bastion, I am going to reboot it at 21:33 UTC today.
Sorry for the inconvenience.
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
Tools admins will be upgrading Toolforge Kubernetes to version 1.19 on Monday July 26th at 1530UTC to catch up to the upstream release cycle. This should be mostly invisible to end users with the occasional pod restarting.
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
We will be upgrading PAWS Kubernetes tomorrow at 1500UTC. User impacts should be minimal, but you might see your notebook server stop and restart during the change at some point. Calico (network overlay) may also be upgraded for both paws and tools, but previous upgrades have had no visible user impact at tall, so that should also be quiet and require no user action.
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
A few weeks ago we rolled out a new service for Cloud VPS users:
OpenStack Trove, aka 'Database as a Service.'
Trove provides automatic orchestration of stand-alone database
instances. In brief, you tell Trove to create a database server with a
given size and backend, and it builds and manages the server and
provides you with ready-made access links. You can also manage databases
and users with Trove, or get a root prompt on the backend itself to
create users and databases.
We have only tested this a little bit, so I invite anyone with interest
to give this a try and let us know what works and what doesn't.
There's a longer blog post about this feature here:
https://techblog.wikimedia.org/2021/07/19/introducing-database-as-a-service…
And some slapdash user documentation here:
https://wikitech.wikimedia.org/wiki/Help:Adding_a_Database_to_a_Cloud_VPS_P…
Bugs and doc-patches are always welcome!
-Andrew + the WMCS team
Greetings!
Over the next two weeks our network staff will be adjusting and
restarting the eqiad network switches. This will affect every server and
service running on WMCS, both toolforge and cloud-vps.
We don't expect this to result in noticeable downtime, but any
connections that are active during the restarts will be interrupted.
It's also always possible that some unexpected side-effect will result
in a prolonged network outage.
One switch will be restarted at 15:00 UTC on July 20th, 22nd, 27th,
29th. The restart on the 27th is the most likely to affect cloud services.
To avoid worst-case scenarios the WMCS team will be failing over several
services before the restarts. Most of these changes won't be noticeable
to users but we'll notify in advance of impact if anything dramatic is
expected.
-Andrew
Hi there,
on Thurs July 22nd at 15:00 UTC (08:00 PDT / 11:00 EDT / 17:00 CEST) there is a
planned network maintenance that will affect the availability of the wiki
replica database service.
The expected operation window is of about 5 minutes long and it will affect any
wiki replicas users including Toolforge tools, PAWS, and any other Cloud VPS
project using them.
More information can be found on phabricator:
https://phabricator.wikimedia.org/T286614
regards.
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
Network maintenance will be happening on Tuesday, July 20th at around 1500 UTC that will affect the maps and scratch cluster on both nodes (see https://phabricator.wikimedia.org/T286069 <https://phabricator.wikimedia.org/T286069>). It should be extremely short in duration (measured in seconds, not minutes). Therefore, we will not be failing them over.
WMCS will keep an eye on the impact to client VMs and will remediate problems where necessary. If all goes well, most services won’t notice.
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
The NFS servers used for scratch and maps mounts (/data/project and /home in the maps project and /data/scratch in other projects) will be going offline for a short time tomorrow 2021-07-01 at around 1600 UTC to move the mounts to DRBD synced volumes. The current setup causes odd issues during failover including data loss and stale files left behind. The process taking place is one of those failovers so there may be some files that were previously deleted that need deleting again present and similar anomalies.
I plan to reboot the maps project servers to make sure they have their mounts and processes restored as best as possible. The scratch mounts should be less impactful. If you use scratch, just be aware that it will go offline for a bit and will be back with some possible quirks. After that, the data should become far more stable and properly synced between the two systems. The process could start later than 1600 UTC if there are sync issues initially as I try to get as much of the data as possible transferred.
More details here https://phabricator.wikimedia.org/T224747 <https://phabricator.wikimedia.org/T224747>
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org