Hello, all!
We are in the process of re-engineering and virtualizing[0] the NFS
service provided to Toolforge and VMs. The transition will be rocky and
involve some service interruption... I'm still running tests to
determine exactly host much disruption will be required.
The first volume that I'd like to replace is 'scratch,' typically
mounted as /mnt/nfs/secondary-scratch. I'm seeking feedback about how
vital scratch uptime is to your current workflow, and how disruptive it
would be to lose data there.
If you have a project or tool that uses scratch, please respond with
your thoughts! My preference would be to wipe out all existing data on
scratch and also subject users to several unannounced periods of
downtime, but I also don't want anyone to suffer. If you have
important/persistent data on that volume then the WMCS team will work
with you to migrate that data somewhere safer, and if you have an
important workflow that will break due to Scratch downtime then I'll
work harder on devising a low-impact roll-out.
Thank you!
-Andrew
[0] https://phabricator.wikimedia.org/T291405
Hello cloud-vps users,
There are still about 84 unclaimed projects at
https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge
Please take a moment to look at that page and mark projects that you are
using.
Unclaimed projects will be in danger of shutdown on February 1st, 2022.
Thank you to those of you who have already acted on this.
Thank you!
- Komla
-------- Forwarded Message --------
Subject: Cloud VPS users, please claim your projects (and, introducing
Komla)
Date: Thu, 2 Dec 2021 14:42:08 -0600
From: Andrew Bogott <abogott(a)wikimedia.org> <abogott(a)wikimedia.org>
Reply-To: abogott(a)wikimedia.org
Organization: The Wikimedia Foundation
To: Cloud-announce(a)lists.wikimedia.org
CC: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
<wikitech-l(a)lists.wikimedia.org>
Hello cloud-vps users!
It's time for our annual cleanup of unused projects and resources. Our new
developer advocate Komla Sapaty will be guiding this process; please
respond promptly to his emails and do your best to make him feel welcome!
Every year or so the Cloud Services team tries to identify and clean up
unused projects and VMs. We do this via an opt-in process: anyone can mark
a project as 'in use,' and that project will be preserved for another year.
I've created a wiki page that lists all existing projects, here:
https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge
If you are a VPS user, please visit that page and mark any projects that
you use as {{Used}}. Note that it's not necessary for you to be a project
admin to mark something -- if you know that you're currently using a
resource and want to keep using it, go ahead and mark it accordingly. If
you /are/ a project admin, please take a moment to mark which VMs are or
aren't used in your projects.
When February arrives, I will shut down and begin the process of reclaiming
resources from unused projects.
If you think you use a VPS project but aren't sure which, I encourage you
to poke around on https://tools.wmflabs.org/openstack-browser/ to see what
looks familiar. Worst case, just email cloud(a)lists.wikimedia.org with a
description of your use case and we'll sort it out there.
Exclusive toolforge users are free to ignore this email and future related
things.
Thank you!
-Andrew and the WMCS team
Hello cloud-vps users!
It's time for our annual cleanup of unused projects and resources. Our
new developer advocate Komla Sapaty will be guiding this process; please
respond promptly to his emails and do your best to make him feel welcome!
Every year or so the Cloud Services team tries to identify and clean up
unused projects and VMs. We do this via an opt-in process: anyone can
mark a project as 'in use,' and that project will be preserved for
another year.
I've created a wiki page that lists all existing projects, here:
https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge
If you are a VPS user, please visit that page and mark any projects that
you use as {{Used}}. Note that it's not necessary for you to be a
project admin to mark something -- if you know that you're currently
using a resource and want to keep using it, go ahead and mark it
accordingly. If you /are/ a project admin, please take a moment to mark
which VMs are or aren't used in your projects.
When February arrives, I will shut down and begin the process of
reclaiming resources from unused projects.
If you think you use a VPS project but aren't sure which, I encourage
you to poke around on https://tools.wmflabs.org/openstack-browser/ to
see what looks familiar. Worst case, just email
cloud(a)lists.wikimedia.org with a description of your use case and we'll
sort it out there.
Exclusive toolforge users are free to ignore this email and future
related things.
Thank you!
-Andrew and the WMCS team
Hi,
Today 2021-11-02 we had a severe network outage on Cloud VPS and Toolforge.
Several network connections were affected from 11:40 UTC to 13:20 UTC (1h40m
duration). As of this writing the problem has been corrected.
Detailed information can be seen in Phabricator:
https://phabricator.wikimedia.org/T294853
Sorry for the inconvenience.
regards.
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
An upgrade to JupyterHub is going out on 2021-10-26. There will be a new
singleuser container as a result. Currently running containers may need
restarting.
Thank you,
Michael DiPietro
Cloud Services SRE
If you were running a Toolforge web tool in Kubernetes before the toollabs-webservice label changes were deployed on 2021-09-29 (https://sal.toolforge.org/tools?d=2021-09-29 <https://sal.toolforge.org/tools?d=2021-09-29>). You may need to run `webservice stop && webservice start` in order to ensure your replica sets have correct label expectations on them going forward. Otherwise you may find confusing states may happen when running webservice restart and similar commands.
When I backfilled the new labels, I missed that you cannot change the label matching rules in a deployment retroactively. I apologize for any inconvenience.
In summary: If you haven’t run a webservice stop since 2021-09-29 on your Kubernetes web service, it would be a good idea to stop and start your webservice now to prevent any confusing behavior from webservice in the future.
--
Brooke Storm
Staff SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org
Next Tuesday we will be upgrading Kubernetes on toolforge. As part of the
upgrade we will need to restart all pods. This will produce a brief
interruption in web services and other tools that use kubernetes. Assuming
your services are able to survive a restart, no action should be needed on
your part.
Thank you,
Michael + the WMCS team
TL;DR:
* Let's Encrypt [0] TLS certificates are "signed" by "root"
certificates to create a chain of trust
* The oldest "root" signing certificate for LE certs (DST Root CA X3)
expired on 2021-09-30 [1]
* Deprecated Toolforge Kubernetes containers only knew this root
certificate and not the newer root certificate (ISRG Root X1)
* Update your tool to a newer container to fix
We are starting to hear reports of tools that suddenly stopped working
on 2021-09-30. The common issue is accessing the APIs for Wikimedia
wikis.
The Wikimedia wikis use multiple TLS certificates issued by different
providers for redundancy and protection against a problem with a
single certificate provider. One of the certificate providers that we
use is Let's Encrypt (LE) [0]. LE certificates are themselves signed
by multiple "root" certificates to create a chain of trust that your
web browser or other TLS verifying software can trust. The oldest root
certificate (named "DST Root CA X3") used to sign the LE certificates
expired on 2021-09-30 [1]. Very old operating systems and some
compiled software do not have the newer root certificate (named "ISRG
Root X1") in their trusted certificate collection. These systems are
now rejecting LE certificates.
In Toolforge, we think that this mainly affects tools running on the
Kubernetes cluster inside Debian Jessie based containers. Specifically
the "php5.6", "python", "python2", and "ruby2" containers are expected
to have issues with the LE certificate expiration based on what we
have found so far. Recommended replacement containers are "php7.4",
"python3.9", and "ruby25".
We also have reports of `mono` on the bastions + grid engine failing.
We do not yet have a fix for this. It will require us to compile and
install a newer version of mono for everyone who is using it.
Interested folks can follow progress of our infrastructure updates in
response to this issue at T291387 [3].
[0]: https://letsencrypt.org/
[1]: https://letsencrypt.org/docs/dst-root-ca-x3-expiration-september-2021/
[2]: https://phabricator.wikimedia.org/T291387
Bryan
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Due to Mediawiki schema changes from:
https://phabricator.wikimedia.org/T291719
the abuse_filter_log.afl_filter column will be dropped from the wiki
replicas views on 2021/10/06. We apologize for any inconvenience. Please
update queries accordingly.
We will be upgrading PAWS Kubernetes 2021/09/07 at 1500UTC. User impacts
should be minimal. but you might see your notebook server stop and restart
during the change at some point.
Michael DiPietro
SRE
Wikimedia Cloud Services