- Cloud-announce - lists.wikimedia.org

Grid outage tomorrow, 16:00 UTC
by Andrew Bogott 21 Dec '18

21 Dec '18

Tomorrow I'll be moving the grid engine master node to a new virt host. That will cause a 15-minute outage during which new jobs (crons, or things submitted by hand) will fail. Existing jobs or webservices will be unaffected by the downtime. I'll start the move at 16:00 UTC on Friday, 2018-12-21. That's 8AM in California. -Andrew

1 3

CloudVPS network maintenance tomorrow 2018-12-20 @ 17:00 UTC
by Arturo Borrero Gonzalez 20 Dec '18

20 Dec '18

Hi! Tomorrow 2018-12-20 @ 17:00 UTC (~24h from now) we will be conducting some network maintenance in Cloud VPS (openstack). We will be doing some works on the transport network that connects the Neutron server to the rest of the internet. Running CloudVPS instances will see a brief connection problem if connected to any external service (outside CloudVPS). If everything goes fine, according to our tests all should be fine, all operations will be finished in just a couple of minutes. Let us know any issue you may find. Thanks.

1 1

BigBrother has been disabled
by Giovanni Tirloni 11 Dec '18

11 Dec '18

Hello, Today we have disabled BigBrother in Toolforge. BigBrother was a tool that monitored continuous jobs that failed to get restarted because they ran into corner cases where Grid Engine wasn't sufficiently smart to re-start them (e.g. out of memory). BigBrother would continuously monitor those jobs and duplicate that functionality on a layer above Grid Engine. Although very few tools used BigBrother (0.65% to be more precise), it taxed our NFS file server constantly so keeping it around didn't make much sense. Additionally, its functionality could be easily implemented with a shell script running from cron. So we've converted all tools that had a .bigbrotherrc file to using a bigbrother.sh script that is triggered every 5min to restart jobs. If your tool used BigBrother, please check your crontab (`crontab -l`) and will see a few entries like this: ``` # Ensure continuous jobs are running */5 * * * * jlocal /data/project/tool_name/bigbrother.sh job_name job_script ``` Documentation has also been updated to reflect this change: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Bigbrother_(Depreca… In our tests everything worked fine but please let us know if your tool is being impacted by this change. Regards, -- Giovanni Tirloni Operations Engineer Wikimedia Cloud Services

1 0

Dumps NFS maintenance - 2018-12-03 @ 1700 UTC and 2018-12-07 @ 1700 UTC
by Brooke Storm 07 Dec '18

07 Dec '18

On Monday, December 3rd, 2018 at 1700 UTC, we will be rebooting one of the two dumps NFS servers (labstore1006.wikimedia.org <http://labstore1006.wikimedia.org/>). This should cause rising load issues briefly, but should be quick enough that failing over services is likely to not be helpful. We will be failing over the web service before that time and failing it back before rebooting the partner server (labstore1007.wikimedia.org <http://labstore1007.wikimedia.org/>) on Friday, December 7th at 1700 UTC. This should not interrupt services to dumps.wikimedia.org <http://dumps.wikimedia.org/> (the site hosted on these systems) since that should be failed over to the non-rebooting partner. Brooke Storm Operations Engineer Wikimedia Cloud Services bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org> IRC: bstorm_

1 4

additional monitoring on cloudvirts -- don't run them empty!
by Andrew Bogott 07 Dec '18

07 Dec '18

I recently noticed that some of our standard kvm/nova monitoring never got copied over from the labvirt puppet code to the cloudvirt puppet code. Tomorrow I will merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478113/ to fix that. Once that patch is merged, icinga will be a bit touchier on the cloudvirts. In particular, it will alert for any cloudvirt that has 0 VMs running on it. (This turns out to be a useful thing to watch for because we've had cases where every single kvm process died at once.) So, all 'idle' cloudvirts should nonetheless have a canary instance. For example, on the new analytics cloudvirts I created canaries like this: $ OS_PROJECT_ID=testlabs openstack server create --image 7c6371d1-8411-48c7-bf73-2ef6d6ff2a15 --flavor m1.small --nic net-id=7425e328-560c-4f00-8e99-706f3fb90bb4 --availability-zone host:cloudvirtan1004 canary-an1004-01 Once a virt host is in full service we can leave the canaries there or delete them -- there hasn't been any real consistent policy there. In related news, I'm attempting to silence cloudvirt1019 and 1020 altogether with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478115/ because we reboot them twice a day and a reboot always kills any running VMs. -Andrew

1 1

Neutron region migrations, round five
by Andrew Bogott 04 Dec '18

04 Dec '18

With any luck we'll have some more hardware installed by next week, so it's time to move more projects! This is probably the last round of bulk moves; after this it's all special cases for which I'll contact people directly. Tuesday, 2018-12-11: maps, wm-bot Wednesday, 2018-12-12: mwoffliner, wildcat Thursday, 2018-12-13: snuggle, services, commonsarchive, wikitextexp Friday, 2018-12-14: queryrapi, wikidumpparse, wikistats, butterfly Monday 2018-12-17: huggle, incubator, iiab, openrefine, wcdo, wikidataconcepts Tuesday 2018-12-18: wikimetrics, newsletter, telnet, signwriting, ogvjs-ingetration Wednesday 2018-12-19: multimedia, orig, security-tools, phragile, wikistream, otrs, yandex-proxy Thursday 2018-12-20: dashiki, etytree, partnermetrics, graphql Some context for what this is all about can be found here: https://phabricator.wikimedia.org/phame/post/view/120/neutron_is_here/ Please let me know if you are involved in one those projects and need to postpone the move, or schedule a to-the-minute migration window. - Andrew + the WMCS team

1 0

ToolsDB Maintenance - 2018-11-27 @ 1730 UTC
by Brooke Storm 27 Nov '18

27 Nov '18

ToolsDB will be undergoing maintenance and updates, Tuesday, November 27th at 1730 UTC to 1800 UTC. Actual outage times should be fairly brief, but during this time the database will be taken offline and the system rebooted. Due to the expected brief nature of the outage and the fact that some tables are not replicated (see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups… <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups…>), we are not planning on failing over to the replica at this time. Brooke Storm Operations Engineer Wikimedia Cloud Services bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org> IRC: bstorm_

1 3

CloudVPS network maintenance 2018-11-27 @ 17:30 UTC
by Arturo Borrero Gonzalez 27 Nov '18

27 Nov '18

Hi, next Tuesday, 2018-11-27 @ 17:30UTC we will reboot the labnet1001.eqiad.wmnet server for maintenance and security updates. This server provides virtual networking services for CloudVPS in the main deployment (the old one, different from the eqiad1 deployment). We won't be doing any failover prior to the reboot for operative reasons (we measured the failover downtime is longer than the actual reboot time). The impact of this brief reboot downtime will be: * all VMs in the main CloudVPS deployment won't have network connectivity * ongoing network connections (downloads, uploads) will fail and will have to be restarted * cross connectivity between VM instances in the main and eqiad1 deployment won't be possible Thanks for your understanding, and let us know any issues you may find after the reboot next week.

1 1

OSM database reboot next Tuesday 2018-11-20 at 17:30 UTC
by Arturo Borrero Gonzalez 20 Nov '18

20 Nov '18

Hi, next Tuesday 2018-11-20 at 17:30 UTC we will be rebooting the OSM database (part of our data services) for maintenance and security updates. In concrete the labstore1006.eqiad.wmnet (osmdb.eqiad.wmnet) server will be rebooted. The other server in the cluster, labstore1007.eqiad.wmnet has been rebooted already, but we won't be doing any pre-failover for operative reasons. Apologies in advance for any inconvenience, and please let us know any issue you may find after these operations.

1 2

tools-bastion-02 aka tools-dev downtime on Tuesday
by Andrew Bogott 20 Nov '18

20 Nov '18

Hello! I need to shut down the tools-dev host in order to move it to a different server. The downtime will be brief, but in the meantime I recommend people move their work to a different bastion (e.g. tools-login.wmflabs.org) in order to avoid interruption. This will happen on or near 15:00 UTC on Tuesday, 2018-11-20. I'll also send alerts to sessions on the bastion prior to the shutdown. -Andrew

1 1

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-announce