Cloud-admin December 2018

cloud-admin@lists.wikimedia.org

6 participants
6 discussions

Grants and views creationg bug update
by Manuel Arostegui 17 Jun '19

17 Jun '19

Hello, Pretty much everyone who's dealt with creating views for new wikis on the labs hosts have experienced issues with "Access denied" sometimes. This was usually due to the MariaDB grant role being missed. We tried to workaround this by including the grant addition on the maintain-views script. Unfortunately, we ran into very weird problems when doing so and this is an example: https://phabricator.wikimedia.org/T193187#4273281 After lots of back and forth we decided to create a bug to MariaDB ( https://jira.mariadb.org/browse/MDEV-16466) which was confirmed by MariaDB yesterday and pointed to a similar issue ( https://jira.mariadb.org/browse/MDEV-14732). The expected fix will come in 10.4 (we are in 10.1), so quite long ahead of us. So, for now, the workaround before adding new views is to manually add the GRANT on the DB and then run the script: GRANT SELECT, SHOW VIEW ON `newiki\_p`.* to labsdbuser'; Hopefully with this email everyone is on the same page now. Thanks everyone (specially Brooke for helping me out with the troubleshooting!) Manuel.

1 1

CloudVPS IPv6
by Arturo Borrero Gonzalez 13 Dec '18

13 Dec '18

Hi, I'm opening this email thread to try to get an overview of IPv6 in CloudVPS. Several times I've heard that Openstack Mitaka wasn't the most appropriate version to start doing IPv6. However, I've read the config docs [0] and I didn't detect any major issues at first. @Chase, do you remember the issues you found? Sooner or later, we will have to handle IPv6 in CloudVPS (and in Toolforge), so a plan could be: * design the ideal IPv6 model [1], with agreement from the main SRE team * start doing tests in the labtestn deployment in codfw * evaluate how this could be added to eqiad1 incrementally Even if we must forget about IPv6 in Mitaka, we could start thinking on the ideal model now. [0] https://docs.openstack.org/mitaka/networking-guide/config-ipv6.html [1] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron_ideal_mo…

2 1

Re: [Cloud-admin] ** PROBLEM alert - labstore1007/High load average is CRITICAL **
by Chase Pettet 13 Dec '18

13 Dec '18

I can see this is still riding high and at least some of it is likely because of file operations I have going there. I had hoped woudl be done today and it's not as far along as I would like. Difficult to estimate as there is a tar+scp happening and I'm not sure how much benefit compression is but it's sure slowing things down. Let me know if this is an issue, I'm keeping an eye on it and hoping it's done today. On Wed, Dec 12, 2018 at 6:30 AM <nagios(a)icinga1001.wikimedia.org> wrote: > Notification Type: PROBLEM > > Service: High load average > Host: labstore1007 > Address: 208.80.155.106 > State: CRITICAL > > Date/Time: Wed Dec 12 12:30:38 UTC 2018 > > Notes URLs: https://grafana.wikimedia.org/dashboard/db/labs-monitoring > > Additional Info: > > CRITICAL: 100.00% of data above the critical threshold [24.0] > -- Chase Pettet chasemp on phabricator <https://phabricator.wikimedia.org/p/chasemp/> and IRC

1 0

Proposal: Close stale Phabricator tasks
by Giovanni Tirloni 13 Dec '18

13 Dec '18

Hello, Like any large project, we have a lot of old tasks/tickets/issues opened and it's hard to know what is still relevant, considering that a lot has changed, or there hasn't been anyone to work on them, etc. To keep things under control, some open source projects have adopted a time limit on how long these can stay open. I would like to propose we do the same. A soft threshold would trigger a warning being added to the task saying it's going stale. After a hard threshold is reached, the task would be closed. This doesn't mean someone can't re-open the task and we could also have a special tag preventing the stale mechanism to activate on it. Any thoughts? Thank you -- Giovanni Tirloni Operations Engineer Wikimedia Cloud Services

4 8

BigBrother removed
by Giovanni Tirloni 12 Dec '18

12 Dec '18

Hi, I just finished converting the remaining tools to use cron and removed BigBrother from Toolforge. The docs have been updated to reflect this. If you notice any issues, please let me know. Regards, -- Giovanni Tirloni Operations Engineer Wikimedia Cloud Services

2 1

Re: [Cloud-admin] [Cloud-announce] additional monitoring on cloudvirts -- don't run them empty!
by Giovanni Tirloni 07 Dec '18

07 Dec '18

On 12/6/18 9:16 PM, Andrew Bogott wrote: > I recently noticed that some of our standard kvm/nova monitoring never > got copied over from the labvirt puppet code to the cloudvirt puppet > code. Tomorrow I will merge > https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478113/ to fix that. > > Once that patch is merged, icinga will be a bit touchier on the > cloudvirts. In particular, it will alert for any cloudvirt that has 0 > VMs running on it. (This turns out to be a useful thing to watch for > because we've had cases where every single kvm process died at once.) > > So, all 'idle' cloudvirts should nonetheless have a canary instance. For > example, on the new analytics cloudvirts I created canaries like this: > > $ OS_PROJECT_ID=testlabs openstack server create --image > 7c6371d1-8411-48c7-bf73-2ef6d6ff2a15 --flavor m1.small --nic > net-id=7425e328-560c-4f00-8e99-706f3fb90bb4 --availability-zone > host:cloudvirtan1004 canary-an1004-01 > > Once a virt host is in full service we can leave the canaries there or > delete them -- there hasn't been any real consistent policy there. Thanks for the heads up and the example command. I think it makes sense to have a canary per cloudvirt. It does mean they are OSes that need to be updated and maybe ignored in metrics collection, but the annoyance should be minimal. It would be good to have a barebones OS image for them but I'd consider that a very low priority. -- Giovanni Tirloni Operations Engineer Wikimedia Cloud Services

1 0

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin December 2018