Tomorrow I'll be moving the grid engine master node to a new virt host.
That will cause a 15-minute outage during which new jobs (crons, or
things submitted by hand) will fail.
Existing jobs or webservices will be unaffected by the downtime.
I'll start the move at 16:00 UTC on Friday, 2018-12-21. That's 8AM in
Tomorrow 2018-12-20 @ 17:00 UTC (~24h from now) we will be conducting
some network maintenance in Cloud VPS (openstack).
We will be doing some works on the transport network that connects the
Neutron server to the rest of the internet. Running CloudVPS instances
will see a brief connection problem if connected to any external service
If everything goes fine, according to our tests all should be fine, all
operations will be finished in just a couple of minutes.
Let us know any issue you may find. Thanks.
Today we have disabled BigBrother in Toolforge. BigBrother was a tool
that monitored continuous jobs that failed to get restarted because they
ran into corner cases where Grid Engine wasn't sufficiently smart to
re-start them (e.g. out of memory). BigBrother would continuously
monitor those jobs and duplicate that functionality on a layer above
Although very few tools used BigBrother (0.65% to be more precise), it
taxed our NFS file server constantly so keeping it around didn't make
much sense. Additionally, its functionality could be easily implemented
with a shell script running from cron.
So we've converted all tools that had a .bigbrotherrc file to using a
bigbrother.sh script that is triggered every 5min to restart jobs. If
your tool used BigBrother, please check your crontab (`crontab -l`) and
will see a few entries like this:
# Ensure continuous jobs are running
*/5 * * * * jlocal /data/project/tool_name/bigbrother.sh job_name job_script
Documentation has also been updated to reflect this change:
In our tests everything worked fine but please let us know if your
tool is being impacted by this change.
Wikimedia Cloud Services
I recently noticed that some of our standard kvm/nova monitoring never
got copied over from the labvirt puppet code to the cloudvirt puppet
code. Tomorrow I will merge
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478113/ to fix that.
Once that patch is merged, icinga will be a bit touchier on the
cloudvirts. In particular, it will alert for any cloudvirt that has 0
VMs running on it. (This turns out to be a useful thing to watch for
because we've had cases where every single kvm process died at once.)
So, all 'idle' cloudvirts should nonetheless have a canary instance.
For example, on the new analytics cloudvirts I created canaries like this:
$ OS_PROJECT_ID=testlabs openstack server create --image
7c6371d1-8411-48c7-bf73-2ef6d6ff2a15 --flavor m1.small --nic
Once a virt host is in full service we can leave the canaries there or
delete them -- there hasn't been any real consistent policy there.
In related news, I'm attempting to silence cloudvirt1019 and 1020
we reboot them twice a day and a reboot always kills any running VMs.
With any luck we'll have some more hardware installed by next week, so
it's time to move more projects! This is probably the last round of
bulk moves; after this it's all special cases for which I'll contact
Tuesday, 2018-12-11: maps, wm-bot
Wednesday, 2018-12-12: mwoffliner, wildcat
Thursday, 2018-12-13: snuggle, services, commonsarchive, wikitextexp
Friday, 2018-12-14: queryrapi, wikidumpparse, wikistats, butterfly
Monday 2018-12-17: huggle, incubator, iiab, openrefine, wcdo,
Tuesday 2018-12-18: wikimetrics, newsletter, telnet, signwriting,
Wednesday 2018-12-19: multimedia, orig, security-tools, phragile,
wikistream, otrs, yandex-proxy
Thursday 2018-12-20: dashiki, etytree, partnermetrics, graphql
Some context for what this is all about can be found here:
Please let me know if you are involved in one those projects and need to
postpone the move, or schedule a to-the-minute migration window.
- Andrew + the WMCS team