Every year or so the Cloud Services team tries to identify and clean up
unused projects and VMs. We do this via an opt-in process: anyone can
mark a project as 'in use,' and that project will be preserved for
another year.
I've created a wiki page the lists all existing projects, here:
https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2019_Purge
If you are a VPS user, please visit that page and mark any projects that
you use as {{Used}}. Note that it's not necessary for you to be a
project admin to mark something -- if you know that you're currently
using a resource and want to keep using it, go ahead and mark it
accordingly. If you /are/ a project admin, please take a moment to mark
which VMs are or aren't used in your projects.
When December arrives, I will shut down and begin the process of
reclaiming resources from unused projects.
If you think you use a VPS project but aren't sure which, I encourage
you to poke around on https://tools.wmflabs.org/openstack-browser/ to
see what looks familiar. Worst case, just email
cloud(a)lists.wikimedia.org with a description of your use case and we'll
sort it out there.
Exclusive toolforge users are free to ignore this task.
Thank you!
-Andrew and WMCS team
Hi,
today 2019-09-30 we were doing an operation in all CloudVPS virtual machines to
update ferm to fix a bug [0]. Ferm is a firewalling utility.
The fleet-wide operation resulted in ferm being installed in every VM, even in
those VMs not requiring it. This resulted in a network outage for most of the
virtual machines and projects that were not previously configured to use ferm.
Many Toolforge tools (webservices, grid jobs, etc) stopped working, database
connection were lost, proxy reported bad gateway errors, etc.
To resolve the issue, we quickly removed ferm from every VM and run puppet agent
to install it just in the VMs that had ferm in their puppet manifests.
As soon as we did this, everything went back to normal.
This incident lasted 1h, give or take.
Please, get in contact in case you see any issue or have any doubts about this
incident.
regards.
[0] https://phabricator.wikimedia.org/T153468
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
Due to a mishap during routine data-center maintenance, one of our
hypervisors lost power just now. Everything is back up and running now,
but some of you may have experienced a few minutes of downtime and an
unexpected reboot of your instance.
Toolforge was largely unaffected by this incident, other than some jobs
getting rescheduled. The VMs that were restarted are:
accounts-dbslave.account-creation-assistance.eqiad.wmflabs
af-netbox01.automation-framework.eqiad.wmflabs
arturo-k8s-test-2.openstack.eqiad.wmflabs
arturo-k8s-test-3.openstack.eqiad.wmflabs
arturo-k8s-test-4-2.openstack.eqiad.wmflabs
beryllium.rcm.eqiad.wmflabs
canary1027-01.testlabs.eqiad.wmflabs
captcha-imageprocessing-11.privpol-captcha.eqiad.wmflabs
clouddb-services-puppetmaster-01.clouddb-services.eqiad.wmflabs
deployment-acme-chief04.deployment-prep.eqiad.wmflabs
deployment-aqs01.deployment-prep.eqiad.wmflabs
deployment-aqs02.deployment-prep.eqiad.wmflabs
deployment-db06.deployment-prep.eqiad.wmflabs
deployment-prometheus02.deployment-prep.eqiad.wmflabs
gnd-02.orig.eqiad.wmflabs
jbond-buster.puppet.eqiad.wmflabs
krenair-t219424-b.testlabs.eqiad.wmflabs
lizenzhinweisgenerator-api-test.lizenzhinweisgenerator.eqiad.wmflabs
logstack03.security-tools.eqiad.wmflabs
mcr-sdc.mcr-dev.eqiad.wmflabs
ntp-02.cloudinfra.eqiad.wmflabs
paws-int-lb-02.paws.eqiad.wmflabs
paws-master-02.paws.eqiad.wmflabs
paws-packages-01.paws.eqiad.wmflabs
paws-proxy-02.paws.eqiad.wmflabs
paws-puppetmaster-01.paws.eqiad.wmflabs
paws-worker-01.paws.eqiad.wmflabs
proxy-01.project-proxy.eqiad.wmflabs
redirects-nginx01.redirects.eqiad.wmflabs
sentry-builder.sentry.eqiad.wmflabs
toolsbeta-docker-registry-01.toolsbeta.eqiad.wmflabs
wikibase-stretch.wikidata-dev.eqiad.wmflabs
wpx-mediawiki-02.wpx.eqiad.wmflabs
On June 30, 2020 the Debian project will stop providing security patch
support for the Debian 8 "Jessie" release. The Cloud Services and SRE
teams at the Wikimedia Foundation would like to have all usage of
Debian Jessie in our managed networks replaced with newer versions of
Debian's operating system on or ideally well before that date.
A page has been created on Wikitech [0] with an initial timeline for
the removal of all Debian Jessie instances from Cloud VPS projects.
This timeline follows roughly the same schedule as we used in 2018
when deprecating Ubuntu Trusty in Cloud VPS projects:
* September 2019: Announce the initiative via this email and the Wikitech page
* October 2019: Start actively contacting instance maintainers who
need to migrate to a new OS
* November & December 2019: Continue to work with instance maintainers
to migrate to a new OS
* January 2020: Shutdown remaining Debian Jessie instances
If you know that your Cloud VPS project is using Debian Jessie, you
can get a head start on migrating your instances to Debian Buster
(preferred) or Stretch by visiting the Wikitech page and reading the
instructions there.
If you are a concerned Toolforge user, stay tuned for future
announcements about changes that will be made as the Toolforge admin
team works to remove Debian Jessie from that environment. For now
there is nothing an individual Tool maintainer needs to do.
[0]: https://wikitech.wikimedia.org/wiki/News/Jessie_deprecation
Bryan - on behalf of the Cloud VPS admin team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
We need to do a proper incident report, but I wanted to send out a
(late) notice that the Toolforge Kubernetes cluster was at best
degraded and at worst completely broken from 2019-09-10T18:54 to
2019-09-11T01:30.
The TL;DR is that some change, likely part of T171188: Move the main
WMCS puppetmaster into the Labs realm, tricked Puppet into installing
an old version of the x509 signing cert used to secure communication
between the etcd cluster and kube-apiserver. This manifested in an
alert from our monitoring system of the Kubernetes api being broken.
When investigating that alert we found that the kube-apiserver was
unable to connect to its paired etcd cluster. The etcd cluster seemed
to be flapping internally (status showing good, then failed, then good
again). Diagnosing the cause of this flapping resulted in a complete
failure of the etcd cluster. Restoring the etcd cluster was a long and
difficult task. Once etcd was recovered, it took about 1.5 more hours
to find the cause and fix for the initial communication errors (the
wrong x509 signing certificate). It is currently unclear if the x509
misconfiguration also caused the etcd cluster failure, or if that was
an unrelated and unfortunate coincidence.
See https://phabricator.wikimedia.org/T232536 for follow up
documentation (when we write it during the coming US business day).
Bryan - on behalf of the Toolforge admin team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Later today (starting in a few hours around 18:00 UTC) we'll be
rearranging the puppetmaster setup for most cloud VMs[0]. No tools or
services (other than puppet) should be affected, but some of you might
get grumpy emails about broken puppet runs during the transition, which
I encourage you to ignore. If you're planning to update the puppet
configuration of your VMs, I encourage you to postpone that work until
after our migration.
[0] full context at https://phabricator.wikimedia.org/T171188
The DNS recursor servers which are used from inside Cloud VPS and
Toolforge to resolve both internal and external hostnames to IP
address were not functional from approximately 2019-09-09T00:51 UTC to
2019-09-09T01:35 UTC. During this time, most (if not all) DNS lookups
would have returned a "SERVFAIL" response. The issue appears to be
resolved now.
We will share more information about what happened and how the problem
was corrected when we are sure that doing so will not cause additional
issues.
Bryan, on behalf of the Cloud VPS admin team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
(Corrected the date in the subject line from the previous notification.)
Next Tuesday on September 3rd, between 13:00 and 14:00 UTC we'll be
performing backend database maintenance on the OpenStack VPS control plane.
During this maintenance window the Horizon web dashboard will be
unavailable and all VPS requests to create, modify or delete VPS resources
like virtual machines and DNS entries will be blocked.
Existing VPS virtual machines will remain running and Toolforge users will
not be affected by this maintenance.
---
Wikimedia Cloud Services