I will be moving the toolforge grid master on Monday. That will mean
that for a few minutes it will be impossible to submit new grid jobs.
Jobs that are already running will be unaffected.
I'll make the move at 14:00UTC, which is about 7AM Pacific time.
-Andrew
Hi there!
Unrelated to other operations that were communicated recently (datacenter PDU
upgrades, operating system upgrades, etc) we need to reboot all the cloudvirt
servers to introduce some security updates for CPU vulnerabilities.
Along with the physical hardware reboot we also need to reboot all the virtual
machines running in CloudVPS.
This operation is a bit disruptive but very quick and should not lead to any
unexpected errors (is just a reboot). We already tried the same upgrades in some
other servers.
We will be doing the reboots during this week (starting 2019-07-29). If you see
any problems related to this, please contact us.
regards.
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
In the abuse_filter_log table, the afl_log_id field appears to never have been used and is being dropped upstream. To prepare for that and keep the replica view up and running, we are going to start removing the field from the abuse_filter_log views on the wiki replicas service, starting on Monday, July 26th.
This should have little to no impact on anyone since the field seems to always be NULL everywhere and to never have been fully implemented, and it would only cause impact to an application if it was specifically queried.
For more information on the dropping of the column: https://phabricator.wikimedia.org/T226851 <https://phabricator.wikimedia.org/T226851>
Some of the reasons why and a bit more context: https://phabricator.wikimedia.org/T214592 <https://phabricator.wikimedia.org/T214592>
Brooke Storm
Operations Engineer
Wikimedia Cloud Services
bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org>
IRC: bstorm_
Hi there!
There is an ongoing maintenance in the eqiad datacenter that involves changing
power connectors of the servers. More info in this phabricator task: T226778 [0].
The PDU upgrade could potentially leave our hypervisors without power briefly.
For some hypervisors, we plan to take the risks of leaving them running. For
some other hypervisors (those running important DBs in the form of virtual
machines) we will probably do a controlled shutdown before the operations to
ensure no data corruption happen in the databases.
The PDU upgrades will happen this very week (see phab task [0]) and it could
potentially affect every virtual machine we run in CloudVPS. This includes
Toolforge.
In the case of power loss, we expect the disruptions to be very briefly and to
don't cause extended downtime in any case.
Please, let us know any issue you may find related to this operation.
regards.
[0] https://phabricator.wikimedia.org/T226778
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation
On Thursday, July 25th, 2019 between the hours of 1500 and 1700 UTC we will
be performing system maintenance on the NFS servers that support Toolforge
and the CloudVPS instances that are using NFS for home, project or scratch
data.
During this maintenance window, we will be applying rolling updates that
require NFS service restarts. We are taking precautions to minimize impact,
but there may be short periods of NFS service interruption or performance
degradation.
---
Jason Hedden
Site Reliability Engineer - Wikimedia Cloud Services
Wikimedia Foundation
As part of routine networking and OS upgrades, I'll be emptying two
hypervisors (cloudvirt1016 and cloudvirt1017) on Monday and Tuesday, the
22nd and 23rd. This will result in downtime for many VMs as they are
copied and restarted. A complete list of affected instances follows.
I'll begin by moving the deployment-prep project at around 13:00
UTC on Monday. After that copies will proceed in roughly the order you
see below, but the timing will be hard to predict.
Please let me know if you need to schedule a more specific window
for your downtime. Better yet, if any of the listed VMs are defunct and
can simply be deleted, please do that now and save me some time!
-Andrew + the WMCS team
Affected instances (shown as <project>: <instance name>):
account-creation-assistance: accounts-appserver4
account-creation-assistance: accounts-mwoauth
automation-framework: af-puppetdb02
butterfly: butterfly-m4m2
cloudinfra: cloudinfra-db02
codereview: Krypton
codereview: Radon
commtech: commtech-2
community-labs-monitoring: clm-web-01
community-labs-monitoring: clm-worker-01
dashiki: dashiki-01
dashiki: dashiki-staging-01
deployment-prep: deployment-cache-text05
deployment-prep: deployment-changeprop
deployment-prep: deployment-chromium01
deployment-prep: deployment-chromium02
deployment-prep: deployment-cpjobqueue
deployment-prep: deployment-dumps-puppetmaster02
deployment-prep: deployment-elastic06
deployment-prep: deployment-elastic07
deployment-prep: deployment-etcd-01
deployment-prep: deployment-eventlog05
deployment-prep: deployment-imagescaler01
deployment-prep: deployment-imagescaler02
deployment-prep: deployment-ircd
deployment-prep: deployment-jobrunner03
deployment-prep: deployment-kafka-jumbo-1
deployment-prep: deployment-logstash2
deployment-prep: deployment-mediawiki-07
deployment-prep: deployment-memc06
deployment-prep: deployment-memc07
deployment-prep: deployment-mwmaint01
deployment-prep: deployment-ores01
deployment-prep: deployment-puppetdb02
deployment-prep: deployment-puppetmaster03
deployment-prep: deployment-restbase01
deployment-prep: deployment-restbase02
deployment-prep: deployment-sentry01
deployment-prep: deployment-snapshot01
deployment-prep: deployment-urldownloader02
deployment-prep: deployment-zookeeper02
design: design-research-methods
dumps: dumps-0
dwl: dwl
dwl: taxonbota
fa-wp: tofawiki02
getstarted: gitservices
getstarted: webservices
glampipe: Glampipe
hound: hound-puppet-02
integration: integration-cumin
integration: integration-r-lang-01
integration: integration-slave-docker-1040
integration: integration-slave-docker-1041
integration: integration-slave-jessie-1002
k8splay: k8s-dzahn
lizenzhinweisgenerator: lizenzhinweisgenerator
maps: maps-tiles1
maps: maps-warper3
openrefine: openrefine01
openstack: cloud-bootstrapvz-stretch
otrs: otrs-oneclickspam-test
packagist-mirror: packagist-mirror1
partnermetrics: partnermetrics-redis-01
puppet-diffs: compiler1001
qna: meza-new2
quotatest: novaadminmadethis6
reading-web-staging: readers-web-master
recommendation-api: missing-sections
recommendation-api: rec-wiki
recommendation-api: related-articles
recommendation-api: tool
security-tools: logparse01
sentry: frama-test5
sentry: frama-test6-sb
services: kask
services: kask-client
shinken: shinken-02
shiny-r: discovery-production-02
testlabs: abogott-puppetmaster
testlabs: canary1016-01
tools: tools-sgecron-01
tools: tools-sgegrid-shadow
toolsbeta: toolsbeta-sgecron-01
toolsbeta: toolsbeta-sgegrid-shadow
toolsbeta: toolsbeta-sgewebgrid-lighttpd-0901
twl: wmil
video: encoding01
video: gfg01
video: video-redis
video: videodev
videowiki: app-instance
visualeditor: dumpgrepper
webperf: disposable
wikidata-dev: wikidata-constraints
wikidata-federation: federated-commons
wikidata-federation: federated-wikidata
wikidiff2-wmde-dev: wmde-wikidiff2-jacnth
wikidocumentaries: hupu
wikidocumentaries: roope
wikidumpparse: whgi
wikifactmine: elasticsearch-20
wikifactmine: elasticsearch-21
wikifactmine: puppetmaster-01
wikilabels: wikilabels-02
wikilabels: wikilabels-experiment
wikimetrics: wikimetrics-01
wikistream: ws-web
wikitextexp: wikitextexp-base-1002
wikitextexp: wikitextexp-expt-1002
wm-bot: wm-bot-pg
wm-bot: wm-bot2
wmf-research-tools: diegoTest
wmf-research-tools: wikilabels
wpx: wpx-redirects-01
On Friday I'll be moving the toolforge cron server to new hardware.
During the move, any uses of the 'crontab' command will fail
gracelessly. Any cron jobs scheduled to launch during the downtime will
be skipped.
The move should take 5-10 minutes but may take as long as 30 if there
are complications.
-Andrew