Hi,
today 2019-09-30 we were doing an operation in all CloudVPS virtual machines to
update ferm to fix a bug [0]. Ferm is a firewalling utility.
The fleet-wide operation resulted in ferm being installed in every VM, even in
those VMs not requiring it. This resulted in a network outage for most of the
virtual machines and projects that were not previously configured to use ferm.
Many Toolforge tools (webservices, grid jobs, etc) stopped working, database
connection were lost, proxy reported bad gateway errors, etc.
To resolve the issue, we quickly removed ferm from every VM and run puppet agent
to install it just in the VMs that had ferm in their puppet manifests.
As soon as we did this, everything went back to normal.
This incident lasted 1h, give or take.
Please, get in contact in case you see any issue or have any doubts about this
incident.
regards.
[0] https://phabricator.wikimedia.org/T153468
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
Due to a mishap during routine data-center maintenance, one of our
hypervisors lost power just now. Everything is back up and running now,
but some of you may have experienced a few minutes of downtime and an
unexpected reboot of your instance.
Toolforge was largely unaffected by this incident, other than some jobs
getting rescheduled. The VMs that were restarted are:
accounts-dbslave.account-creation-assistance.eqiad.wmflabs
af-netbox01.automation-framework.eqiad.wmflabs
arturo-k8s-test-2.openstack.eqiad.wmflabs
arturo-k8s-test-3.openstack.eqiad.wmflabs
arturo-k8s-test-4-2.openstack.eqiad.wmflabs
beryllium.rcm.eqiad.wmflabs
canary1027-01.testlabs.eqiad.wmflabs
captcha-imageprocessing-11.privpol-captcha.eqiad.wmflabs
clouddb-services-puppetmaster-01.clouddb-services.eqiad.wmflabs
deployment-acme-chief04.deployment-prep.eqiad.wmflabs
deployment-aqs01.deployment-prep.eqiad.wmflabs
deployment-aqs02.deployment-prep.eqiad.wmflabs
deployment-db06.deployment-prep.eqiad.wmflabs
deployment-prometheus02.deployment-prep.eqiad.wmflabs
gnd-02.orig.eqiad.wmflabs
jbond-buster.puppet.eqiad.wmflabs
krenair-t219424-b.testlabs.eqiad.wmflabs
lizenzhinweisgenerator-api-test.lizenzhinweisgenerator.eqiad.wmflabs
logstack03.security-tools.eqiad.wmflabs
mcr-sdc.mcr-dev.eqiad.wmflabs
ntp-02.cloudinfra.eqiad.wmflabs
paws-int-lb-02.paws.eqiad.wmflabs
paws-master-02.paws.eqiad.wmflabs
paws-packages-01.paws.eqiad.wmflabs
paws-proxy-02.paws.eqiad.wmflabs
paws-puppetmaster-01.paws.eqiad.wmflabs
paws-worker-01.paws.eqiad.wmflabs
proxy-01.project-proxy.eqiad.wmflabs
redirects-nginx01.redirects.eqiad.wmflabs
sentry-builder.sentry.eqiad.wmflabs
toolsbeta-docker-registry-01.toolsbeta.eqiad.wmflabs
wikibase-stretch.wikidata-dev.eqiad.wmflabs
wpx-mediawiki-02.wpx.eqiad.wmflabs
On June 30, 2020 the Debian project will stop providing security patch
support for the Debian 8 "Jessie" release. The Cloud Services and SRE
teams at the Wikimedia Foundation would like to have all usage of
Debian Jessie in our managed networks replaced with newer versions of
Debian's operating system on or ideally well before that date.
A page has been created on Wikitech [0] with an initial timeline for
the removal of all Debian Jessie instances from Cloud VPS projects.
This timeline follows roughly the same schedule as we used in 2018
when deprecating Ubuntu Trusty in Cloud VPS projects:
* September 2019: Announce the initiative via this email and the Wikitech page
* October 2019: Start actively contacting instance maintainers who
need to migrate to a new OS
* November & December 2019: Continue to work with instance maintainers
to migrate to a new OS
* January 2020: Shutdown remaining Debian Jessie instances
If you know that your Cloud VPS project is using Debian Jessie, you
can get a head start on migrating your instances to Debian Buster
(preferred) or Stretch by visiting the Wikitech page and reading the
instructions there.
If you are a concerned Toolforge user, stay tuned for future
announcements about changes that will be made as the Toolforge admin
team works to remove Debian Jessie from that environment. For now
there is nothing an individual Tool maintainer needs to do.
[0]: https://wikitech.wikimedia.org/wiki/News/Jessie_deprecation
Bryan - on behalf of the Cloud VPS admin team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
We need to do a proper incident report, but I wanted to send out a
(late) notice that the Toolforge Kubernetes cluster was at best
degraded and at worst completely broken from 2019-09-10T18:54 to
2019-09-11T01:30.
The TL;DR is that some change, likely part of T171188: Move the main
WMCS puppetmaster into the Labs realm, tricked Puppet into installing
an old version of the x509 signing cert used to secure communication
between the etcd cluster and kube-apiserver. This manifested in an
alert from our monitoring system of the Kubernetes api being broken.
When investigating that alert we found that the kube-apiserver was
unable to connect to its paired etcd cluster. The etcd cluster seemed
to be flapping internally (status showing good, then failed, then good
again). Diagnosing the cause of this flapping resulted in a complete
failure of the etcd cluster. Restoring the etcd cluster was a long and
difficult task. Once etcd was recovered, it took about 1.5 more hours
to find the cause and fix for the initial communication errors (the
wrong x509 signing certificate). It is currently unclear if the x509
misconfiguration also caused the etcd cluster failure, or if that was
an unrelated and unfortunate coincidence.
See https://phabricator.wikimedia.org/T232536 for follow up
documentation (when we write it during the coming US business day).
Bryan - on behalf of the Toolforge admin team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Later today (starting in a few hours around 18:00 UTC) we'll be
rearranging the puppetmaster setup for most cloud VMs[0]. No tools or
services (other than puppet) should be affected, but some of you might
get grumpy emails about broken puppet runs during the transition, which
I encourage you to ignore. If you're planning to update the puppet
configuration of your VMs, I encourage you to postpone that work until
after our migration.
[0] full context at https://phabricator.wikimedia.org/T171188
The DNS recursor servers which are used from inside Cloud VPS and
Toolforge to resolve both internal and external hostnames to IP
address were not functional from approximately 2019-09-09T00:51 UTC to
2019-09-09T01:35 UTC. During this time, most (if not all) DNS lookups
would have returned a "SERVFAIL" response. The issue appears to be
resolved now.
We will share more information about what happened and how the problem
was corrected when we are sure that doing so will not cause additional
issues.
Bryan, on behalf of the Cloud VPS admin team
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
(Corrected the date in the subject line from the previous notification.)
Next Tuesday on September 3rd, between 13:00 and 14:00 UTC we'll be
performing backend database maintenance on the OpenStack VPS control plane.
During this maintenance window the Horizon web dashboard will be
unavailable and all VPS requests to create, modify or delete VPS resources
like virtual machines and DNS entries will be blocked.
Existing VPS virtual machines will remain running and Toolforge users will
not be affected by this maintenance.
---
Wikimedia Cloud Services
Today I rebuilt the Docker images that are used by the `webservice
--backend=kubernetes` command. This is actually a normal thing that we
do periodically in Toolforge to ensure that security patches are
applied in the containers. This round of updates was a bit different
however in that it is the first time the Debian Jessie based images
have been rebuilt since the upstream Debian project removed the
'jessie-backports' apt repo.
Everything should be fine, but if you see weirdness when restarting a
webservice or other Kubernetes pod that looks like it could be related
to software in the Docker image please let myself or one of the
Toolforge admins know by either filing a Phabricator bug report or for
faster response joining the #wikimedia-cloud IRC channel on Freenode
and sending a "!help ...." message to the channel explaining your
issue.
Bryan - on behalf of the Toolforge admins
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
Next Tuesday on September 3rd, between 13:00 and 14:00 UTC we'll be
performing backend database maintenance on the OpenStack VPS control plane.
During this maintenance window the Horizon web dashboard will be
unavailable and all VPS requests to create, modify or delete VPS resources
like virtual machines and DNS entries will be unavailable.
Existing VPS virtual machines will remain running and Toolforge users will
not be affected by this maintenance.
---
Wikimedia Cloud Services Team