Cloud-admin February 2019

cloud-admin@lists.wikimedia.org

8 participants
7 discussions

Grants and views creationg bug update
by Manuel Arostegui 17 Jun '19

17 Jun '19

Hello, Pretty much everyone who's dealt with creating views for new wikis on the labs hosts have experienced issues with "Access denied" sometimes. This was usually due to the MariaDB grant role being missed. We tried to workaround this by including the grant addition on the maintain-views script. Unfortunately, we ran into very weird problems when doing so and this is an example: https://phabricator.wikimedia.org/T193187#4273281 After lots of back and forth we decided to create a bug to MariaDB ( https://jira.mariadb.org/browse/MDEV-16466) which was confirmed by MariaDB yesterday and pointed to a similar issue ( https://jira.mariadb.org/browse/MDEV-14732). The expected fix will come in 10.4 (we are in 10.1), so quite long ahead of us. So, for now, the workaround before adding new views is to manually add the GRANT on the DB and then run the script: GRANT SELECT, SHOW VIEW ON `newiki\_p`.* to labsdbuser'; Hopefully with this email everyone is on the same page now. Thanks everyone (specially Brooke for helping me out with the troubleshooting!) Manuel.

1 1

Exit strategy for Shinken
by Giovanni Tirloni 26 Feb '19

26 Feb '19

Hi, If we poke roles in the firewall so Icinga can reach the VMs and we define the monitoring::service stuff in Puppet, is that all we need to shutdown Shinken? Do you think there would be any concerns with going that route? I'm asking about this because soon we'll see requests about removing more and more Jessie support from the Puppet codebase [0] and there's no exit strategy from Jessie for Shinken (it's not available in Stretch unless we want to package things ourselves, which I tried and failed). 0 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/491460 Thanks, -- Giovanni Tirloni Operations Engineer Wikimedia Foundation

5 7

Enhancement Proposal: Add spare disks
by Giovanni Tirloni 17 Feb '19

17 Feb '19

Hello, Here is my proposal to enhance our MTTR for disk failures. https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme… Also, feel free to provide meta-feedback on the format/process of this proposal. Regards, -- Giovanni Tirloni Operations Engineer Wikimedia Foundation

1 0

Current status of Toolforge and Cloud VPS (2019-02-16)
by Arturo Borrero Gonzalez 16 Feb '19

16 Feb '19

Hi, Here is just a brief update on the status of Toolforge and CloudVPS by today 2019-02-16, along with some guess-estimations and what to expect in following days. Keeping track of all the events we had this week may be complex, because they were several of them, and heavily intermixed. * CloudVPS suffered severe hardware issues this week [0]. We solved most of the problems and added spare hardware [1] because our server capacity was really lowered. This service should be mostly stable right now. * Toolsdb (tools.db.svc.eqiad.wmflabs) is currently overloaded and suffering from hardware errors. We are already working on a replacement for this service [2]. Services depending on this database aren't working properly (like PAWS) and Toolforge tools that use it are also affected. An honest estimation is that services (specially Toolsdb) we won't be fully recovered until at least next Tuesday (2019-02-26). Our current plans involve replacing the Toolsdb hardware with virtual machines inside CloudVPS [3]. We are trying to be extra cautious to prevent data loss and other problems usually associated with doing things in a rush. Finally, I would like to mention that we are all well aware of the importance of these services for the community and we are doing our best to get things fixed. Thanks for your understanding and patience. regards [0] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps [1] CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009 https://phabricator.wikimedia.org/T216239 [2] ToolsDB overload and cleanup https://phabricator.wikimedia.org/T216208 [3] Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020 https://phabricator.wikimedia.org/T193264 -- Arturo Borrero Gonzalez Operations Engineer / Wikimedia Cloud Services Wikimedia Foundation

1 0

Unexpected downtime of cloudvirt1018
by Arturo Borrero Gonzalez 13 Feb '19

13 Feb '19

Hi, due to a HW issue with the disks [0], cloudvirt1018 is currently shutdown and not responding in any capacity. All VM instances that were running in this hypervisor are totally unreachable. The count of affected VM instances [1] is 64. [0] Degraded RAID on cloudvirt1018 https://phabricator.wikimedia.org/T216004 [1] cloudvps: evaluate draining cloudvirt1018 https://phabricator.wikimedia.org/T216030 -- Arturo Borrero Gonzalez Operations Engineer / Wikimedia Cloud Services Wikimedia Foundation

3 2

Example email for Toolforge Trusty deprecation nag
by Bryan Davis 07 Feb '19

07 Feb '19

I have a script created to send out nag emails for moving things off of the Trusty job grid. I wanted to share an example output with y'all before I run the script for the first time to get any feedback you may have about better wording or other changes. In an attempt to keep folks who are running a lot of tools from being bombarded by messages the script will send a single email per Toolforge maintainer listing all of the tools that the particular maintainer has access to that need to be migrated. Bryan ---------- Forwarded message --------- From: Toolforge admins <tools.admin(a)tools.wmflabs.org> Date: Wed, Feb 6, 2019 at 10:28 PM Subject: [Toolforge] Tools you maintain are running on Trusty job grid To: <bd808(a)tools.wmflabs.org> Hello bd808, This email is a reminder that the tools listed below have run jobs and/or webservices using the Ubuntu Trusty job grid in the past 7 days. This job grid will be shutdown on or before the week of 2019-03-25 as the final step in the removal of Ubuntu Trusty from Toolforge and the larger Cloud VPS environment. * convert * gridengine-status * irc-wmt * jouncebot * meetbot * my-first-flask-tool * mysql-php-session-test * stewardbots * sulinfo See <https://tools.wmflabs.org/trusty-tools/u/bd808> for more details on these tools and the jobs that have been seen. Please see the migration instructions on Wikitech [0] for more information on how to move your tools to either the new Debian Stretch job grid or the Kubernetes cluster. [0]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation Thanks, The Toolforge admin team -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855

5 4

Alert fatigue - High iowait
by Giovanni Tirloni 04 Feb '19

04 Feb '19

Hi, These emails are causing alert fatigue. We've tweaked the thresholds high enough to make them rare but they still ocurr and we never take any action (in part because there's nothing feasible to be done until we change our storage situation and/or most workloads are migrated to Kubernetes where we could implement better controls). I'd like to propose we disable these alerts for the time being and re-evaluate our service level indicators when appropriate. Giovanni Tirloni Operations Engineer Wikimedia Cloud Services On Mon, Feb 4, 2019, 01:47 shinken <shinken(a)shinken-02.shinken.eqiad.wmflabs wrote: > Notification Type: RECOVERY > > Service: High iowait > Host: tools-exec-1419 > Address: 10.68.23.223 > State: OK > > Date/Time: Mon 04 Feb 03:46:59 UTC 2019 > > Notes URLs: > > Additional Info: > > OK: All targets OK >

3 4

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin February 2019