Cloud-announce September 2019

cloud-announce@lists.wikimedia.org

4 participants
8 discussions

Cloud VPS users, please claim your projects

by Andrew Bogott

Every year or so the Cloud Services team tries to identify and clean up unused projects and VMs. We do this via an opt-in process: anyone can mark a project as 'in use,' and that project will be preserved for another year. I've created a wiki page the lists all existing projects, here: https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2019_Purge If you are a VPS user, please visit that page and mark any projects that you use as {{Used}}. Note that it's not necessary for you to be a project admin to mark something -- if you know that you're currently using a resource and want to keep using it, go ahead and mark it accordingly. If you /are/ a project admin, please take a moment to mark which VMs are or aren't used in your projects. When December arrives, I will shut down and begin the process of reclaiming resources from unused projects. If you think you use a VPS project but aren't sure which, I encourage you to poke around on https://tools.wmflabs.org/openstack-browser/ to see what looks familiar. Worst case, just email cloud(a)lists.wikimedia.org with a description of your use case and we'll sort it out there. Exclusive toolforge users are free to ignore this task. Thank you! -Andrew and WMCS team

4 years, 5 months

Networking incident today in CloudVPS (ferm update)

by Arturo Borrero Gonzalez

Hi, today 2019-09-30 we were doing an operation in all CloudVPS virtual machines to update ferm to fix a bug [0]. Ferm is a firewalling utility. The fleet-wide operation resulted in ferm being installed in every VM, even in those VMs not requiring it. This resulted in a network outage for most of the virtual machines and projects that were not previously configured to use ferm. Many Toolforge tools (webservices, grid jobs, etc) stopped working, database connection were lost, proxy reported bad gateway errors, etc. To resolve the issue, we quickly removed ferm from every VM and run puppet agent to install it just in the VMs that had ferm in their puppet manifests. As soon as we did this, everything went back to normal. This incident lasted 1h, give or take. Please, get in contact in case you see any issue or have any doubts about this incident. regards. [0] https://phabricator.wikimedia.org/T153468 -- Arturo Borrero Gonzalez SRE / Wikimedia Cloud Services Wikimedia Foundation

4 years, 6 months

Some VMs were rebooted

by Andrew Bogott

Due to a mishap during routine data-center maintenance, one of our hypervisors lost power just now. Everything is back up and running now, but some of you may have experienced a few minutes of downtime and an unexpected reboot of your instance. Toolforge was largely unaffected by this incident, other than some jobs getting rescheduled. The VMs that were restarted are: accounts-dbslave.account-creation-assistance.eqiad.wmflabs af-netbox01.automation-framework.eqiad.wmflabs arturo-k8s-test-2.openstack.eqiad.wmflabs arturo-k8s-test-3.openstack.eqiad.wmflabs arturo-k8s-test-4-2.openstack.eqiad.wmflabs beryllium.rcm.eqiad.wmflabs canary1027-01.testlabs.eqiad.wmflabs captcha-imageprocessing-11.privpol-captcha.eqiad.wmflabs clouddb-services-puppetmaster-01.clouddb-services.eqiad.wmflabs deployment-acme-chief04.deployment-prep.eqiad.wmflabs deployment-aqs01.deployment-prep.eqiad.wmflabs deployment-aqs02.deployment-prep.eqiad.wmflabs deployment-db06.deployment-prep.eqiad.wmflabs deployment-prometheus02.deployment-prep.eqiad.wmflabs gnd-02.orig.eqiad.wmflabs jbond-buster.puppet.eqiad.wmflabs krenair-t219424-b.testlabs.eqiad.wmflabs lizenzhinweisgenerator-api-test.lizenzhinweisgenerator.eqiad.wmflabs logstack03.security-tools.eqiad.wmflabs mcr-sdc.mcr-dev.eqiad.wmflabs ntp-02.cloudinfra.eqiad.wmflabs paws-int-lb-02.paws.eqiad.wmflabs paws-master-02.paws.eqiad.wmflabs paws-packages-01.paws.eqiad.wmflabs paws-proxy-02.paws.eqiad.wmflabs paws-puppetmaster-01.paws.eqiad.wmflabs paws-worker-01.paws.eqiad.wmflabs proxy-01.project-proxy.eqiad.wmflabs redirects-nginx01.redirects.eqiad.wmflabs sentry-builder.sentry.eqiad.wmflabs toolsbeta-docker-registry-01.toolsbeta.eqiad.wmflabs wikibase-stretch.wikidata-dev.eqiad.wmflabs wpx-mediawiki-02.wpx.eqiad.wmflabs

4 years, 7 months

Debian Jessie deprecation plans

by Bryan Davis

On June 30, 2020 the Debian project will stop providing security patch support for the Debian 8 "Jessie" release. The Cloud Services and SRE teams at the Wikimedia Foundation would like to have all usage of Debian Jessie in our managed networks replaced with newer versions of Debian's operating system on or ideally well before that date. A page has been created on Wikitech [0] with an initial timeline for the removal of all Debian Jessie instances from Cloud VPS projects. This timeline follows roughly the same schedule as we used in 2018 when deprecating Ubuntu Trusty in Cloud VPS projects: * September 2019: Announce the initiative via this email and the Wikitech page * October 2019: Start actively contacting instance maintainers who need to migrate to a new OS * November & December 2019: Continue to work with instance maintainers to migrate to a new OS * January 2020: Shutdown remaining Debian Jessie instances If you know that your Cloud VPS project is using Debian Jessie, you can get a head start on migrating your instances to Debian Buster (preferred) or Stretch by visiting the Wikitech page and reading the instructions there. If you are a concerned Toolforge user, stay tuned for future announcements about changes that will be made as the Toolforge admin team works to remove Debian Jessie from that environment. For now there is nothing an individual Tool maintainer needs to do. [0]: https://wikitech.wikimedia.org/wiki/News/Jessie_deprecation Bryan - on behalf of the Cloud VPS admin team -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

4 years, 7 months

Toolforge Kubernetes disrupted from 2019-09-10T18:54 to 2019-09-11T01:30

by Bryan Davis

We need to do a proper incident report, but I wanted to send out a (late) notice that the Toolforge Kubernetes cluster was at best degraded and at worst completely broken from 2019-09-10T18:54 to 2019-09-11T01:30. The TL;DR is that some change, likely part of T171188: Move the main WMCS puppetmaster into the Labs realm, tricked Puppet into installing an old version of the x509 signing cert used to secure communication between the etcd cluster and kube-apiserver. This manifested in an alert from our monitoring system of the Kubernetes api being broken. When investigating that alert we found that the kube-apiserver was unable to connect to its paired etcd cluster. The etcd cluster seemed to be flapping internally (status showing good, then failed, then good again). Diagnosing the cause of this flapping resulted in a complete failure of the etcd cluster. Restoring the etcd cluster was a long and difficult task. Once etcd was recovered, it took about 1.5 more hours to find the cause and fix for the initial communication errors (the wrong x509 signing certificate). It is currently unclear if the x509 misconfiguration also caused the etcd cluster failure, or if that was an unrelated and unfortunate coincidence. See https://phabricator.wikimedia.org/T232536 for follow up documentation (when we write it during the coming US business day). Bryan - on behalf of the Toolforge admin team -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

4 years, 7 months

Puppetmaster changes today

by Andrew Bogott

Later today (starting in a few hours around 18:00 UTC) we'll be rearranging the puppetmaster setup for most cloud VMs[0]. No tools or services (other than puppet) should be affected, but some of you might get grumpy emails about broken puppet runs during the transition, which I encourage you to ignore. If you're planning to update the puppet configuration of your VMs, I encourage you to postpone that work until after our migration. [0] full context at https://phabricator.wikimedia.org/T171188

4 years, 7 months

DNS errors early 2019-09-09 UTC may have affected jobs and servers

by Bryan Davis

The DNS recursor servers which are used from inside Cloud VPS and Toolforge to resolve both internal and external hostnames to IP address were not functional from approximately 2019-09-09T00:51 UTC to 2019-09-09T01:35 UTC. During this time, most (if not all) DNS lookups would have returned a "SERVFAIL" response. The issue appears to be resolved now. We will share more information about what happened and how the problem was corrected when we are sure that doing so will not cause additional issues. Bryan, on behalf of the Cloud VPS admin team -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

4 years, 7 months

Cloud VPS database maintenance Tuesday 3rd Sept at 13:00 UTC

by Jason Hedden

(Corrected the date in the subject line from the previous notification.) Next Tuesday on September 3rd, between 13:00 and 14:00 UTC we'll be performing backend database maintenance on the OpenStack VPS control plane. During this maintenance window the Horizon web dashboard will be unavailable and all VPS requests to create, modify or delete VPS resources like virtual machines and DNS entries will be blocked. Existing VPS virtual machines will remain running and Toolforge users will not be affected by this maintenance. --- Wikimedia Cloud Services

4 years, 7 months

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-announce September 2019