On Wed 26-Feb, we are making a large change to how NFS is mounted in
Cloud Services
https://gerrit.wikimedia.org/r/c/operations/puppet/+/571821. This will
impact any Cloud VPS projects that mount NFS for home directories,
project directories and scratch, including Toolforge. During this
change, NFS will become unresponsive for a short time. Some NFS clients
will recover on their own with little impact. Where needed, WMCS will
reboot or remount NFS clients.
This change will improve future NFS management and hopefully reduce
future disruptions from maintenance.
Following a beta testing period [0] and a general use self-migration
period [1], the Toolforge administration team is ready to begin the
final phase of automatic migration of tools currently running on the
legacy Kubernetes cluster to the 2020 Kubernetes cluster.
The migration process will involve Toolforge administrators running
`webservice migrate` for each tool in the same way that self-migration
happens [2]. A small number of tools are using the legacy Kubernetes
cluster outside of the `webservice` system. These tools will be moved
using a more manual process after move all webservices. We are
currently planning on doing these migrations in several batches so
that we can monitor the load and capacity of the 2020 Kubernetes
cluster as we move ~640 more tools over from the legacy cluster.
Once the tools have all been moved to the 2020 cluster, we will
continue with additional clean up and default configuration changes
which will allow us to fully decommission the legacy cluster. We will
also be updating various documentation on Wikitech during this final
phase. We hope to complete this entire process by 2020-03-06 at the
latest.
[0]: https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000247.ht…
[1]: https://lists.wikimedia.org/pipermail/cloud-announce/2020-January/000252.ht…
[2]: https://wikitech.wikimedia.org/wiki/News/2020_Kubernetes_cluster_migration#…
Bryan (on behalf of the Toolforge admins and the Cloud Services team)
--
Bryan Davis Technical Engagement Wikimedia Foundation
Principal Software Engineer Boise, ID USA
[[m:User:BDavis_(WMF)]] irc: bd808
The barrage of hardware failures continues! Next week the eqiad staff
will be repairing cloudvirt1014; to prepare I'll be draining it on this
Thursday (2020-02-20), beginning around 15:00 UTC. Affected instances
will be down for a few minutes and then rebooted. Toolforge users should
be largely unaffected by this maintenance.
Here is the list of affected VMs:
traffic-cache-atsupload-buster
canary1014-01
util-abogott-buster
filippo-log-buster01
staging
opusmt
mw01
grantreview-04
cloud-puppetmaster-03
commtech-wikiwho
toolsbeta-test-k8s-haproxy-2
debmonitor-pm
toolsbeta-test-k8s-control-1
toolsbeta-test-k8s-etcd-2
toolsbeta-test-k8s-etcd-1
jmm-debm-01
xtools-dev05
ores-web-05
roebling
ores-web-04
cloudinfra-db02
Krypton
discovery-production-02
maps-tiles1
wikitextexp-base-1002
accounts-appserver4
tofawiki02
packagist-mirror1
deployment-elastic06
deployment-changeprop
deployment-restbase02
deployment-imagescaler01
deployment-kafka-jumbo-1
deployment-memc07
deployment-eventlog05
deployment-cpjobqueue
deployment-mediawiki-07
deployment-chromium01
deployment-cache-text05
whgi
wikilabels
gitservices
wikilabels-02
af-puppetdb02
missing-sections
ores-lb-03
matrix-synapse-01
captcha-tf-43
k4-2
Tomorrow I need to relocate 'tools-sgecron-01,' the VM that is in charge
of starting cron jobs on the grid. The host will be down for 5-10
minutes, during which time no cronjobs will start.
I'm going to make the move around 16:00 UTC, although I'll wait until a
few minutes after so that on-the-hour jobs still happen.
-Andrew
On Tuesday morning I'm going to switch the OpenStack Keystone token
engine from UUID tokens to fernet tokens[0]. The changeover[1] will be
abrupt and cause all existing sessions to reset (e.g. if you're using
Horizon you'll have to log back in, and if you're in the middle of
creating a VM that creation will probably fail.) Should you encounter
this interruption, just try again in a few minutes and everything should
be fine.
The new tokens will generally behave the same from a user standpoint,
but will allow us to simplify and modernize things on our end a bit.
The switch will happen at 15:00 UTC on this Tuesday, 2020-02-18. That's
7AM, Pacific Coast time.
[0]
https://docs.openstack.org/keystone/pike/admin/identity-fernet-token-faq.ht…
[1] https://phabricator.wikimedia.org/T243418
In order to repair some ailing hardware, I'm going to migrate several
cloud-vps instances later today. Each will be down for a few minutes
(or longer, depending on disk size) and then rebooted. Toolforge users
will be unaffected by this change.
Affected VMs will be:
vanadium
cloudstore-client-2
cloudstore-client-1
cn-staging-2
wikispore-test
wm-bot
traffic-cache-atstext-buster
etherpad
traffic-cache-atsupload
jbond-buster
commons-corruption-checker-main
codesearch6
canary1022-01
ores-web-06
cloud-cumin-01
petscan4
thanos-prom01
thanos-be01
extdist-04
On Monday we'll be restarting the database server that supports most
WMCS services. During the restart various things will fail: Wikitech
pages will fail to load, OpenStack API calls will fail, etc.
In all cases if you encounter an issue you can just count to 20 and try
again, by which time things will most likely be back up. Active VMs,
tools and other things hosted on toolforge or cloud-VPS should be
unaffected.
The restart will happen at 15:00 UTC on Monday -- that's 7AM Pacific Time.