Cloud-admin May 2019

cloud-admin@lists.wikimedia.org

3 participants
6 discussions

by Manuel Arostegui

Hello, Pretty much everyone who's dealt with creating views for new wikis on the labs hosts have experienced issues with "Access denied" sometimes. This was usually due to the MariaDB grant role being missed. We tried to workaround this by including the grant addition on the maintain-views script. Unfortunately, we ran into very weird problems when doing so and this is an example: https://phabricator.wikimedia.org/T193187#4273281 After lots of back and forth we decided to create a bug to MariaDB ( https://jira.mariadb.org/browse/MDEV-16466) which was confirmed by MariaDB yesterday and pointed to a similar issue ( https://jira.mariadb.org/browse/MDEV-14732). The expected fix will come in 10.4 (we are in 10.1), so quite long ahead of us. So, for now, the workaround before adding new views is to manually add the GRANT on the DB and then run the script: GRANT SELECT, SHOW VIEW ON `newiki\_p`.* to labsdbuser'; Hopefully with this email everyone is on the same page now. Thanks everyone (specially Brooke for helping me out with the troubleshooting!) Manuel.

4 years, 10 months

cloudservices1003 rebuild on 2019-06-03

by Arturo Borrero Gonzalez

Hi! On 2019-06-03 UTC+2 14:00 (next monday) we will be rebuilding the cloudservices1003 server, that holds the designate service which serves DNS request for CloudVPS and Toolforge. We have a backup server -cloudservices1004-, so we don't expect a lot of downtime. But DNS queries are really fast, and there may be a lot of them that will fail while we stabilize the DNS service. Please reach out to the WMCS team if you need more details or have any doubts. regards. -- Arturo Borrero Gonzalez Operations Engineer / Wikimedia Cloud Services Wikimedia Foundation

4 years, 10 months

Outage in CloudVPS today 2019-05-29 due to keystone + NFS interaction

by Arturo Borrero Gonzalez

Hi, there was a outage today 2019-05-29 in CloudVPS/Toolforge involving keystone + NFS. All CloudVPS projects (including Toolforge) had troubles using the NFS-based storage due to an upgrade operation we were doing in cloudcontrol1003.wikimedia.org. You can read more about the incident here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190529-NFS-key… The incident postmorten is not completed yet, but you can already read the main sections: * what happened * timeline * things to improve in the future regards. -- Arturo Borrero Gonzalez Operations Engineer / Wikimedia Cloud Services Wikimedia Foundation

4 years, 10 months

PyCon Conference Notes

by Brooke Storm

This was my first trip to PyCon, and I can definitely say it is a strange bird as conferences go (surprisingly emotional). On the more standard conference side of it, besides a lot of hacking and people trying to sell things or hire people, these elements stood out: Python 2: Everyone is sort of dancing on Python 2’s grave in the Python community. There are stickers of its grave that were so popular I could only get ones that have a company name on them as well. The transition is firmly established as a Good Thing, and it is now seen as a problematic issue to have Python 2 in an environment (*glares at Debian*). This is probably well-known here, but it bears mentioning. Also: https://pythonclock.org/ <https://pythonclock.org/> Black: The auto formatter black is catching on a lot. Django is considering a move to it. CircuitPython has an open issue to move everything in basically all libraries that isn’t C to it. It is becoming a fairly well-regarded way of generating low-diff code by way of simply not making formatting decisions (which is why it has no configuration except the CLI arg to change line length). I’ve been using it on things I touch where it makes sense. PEP554: The effort to allow sub-interpreters and make threading so much more disastrously fun is moving right along. If you hate the GIL, you’ll like this or hate this even more. It is expected to actually show up in python 3.8-3.9 somewhere. This means, it probably won’t show up in Buster? However, with the power of pyenv and similar things, actual concurrency in Python may be coming to a Toolforge or someone’s VPS project near you one day. Until then, it’s kind of cool to know it might be coming. https://www.python.org/dev/peps/pep-0554/ <https://www.python.org/dev/peps/pep-0554/> pipenv: Pipenv is still not “standard”, but it is picking up steam. If other efforts to package up OpenStack in deb, containers, etc. fail, the specificity of using Pipenv.lock files and such may turn pipenv into a possible very good deploy alternative. They don’t usually seem to want to say it, but I will, it makes deploying python as well-developed as deploying nodejs or rails ;-) https://github.com/pypa/pipenv <https://github.com/pypa/pipenv> There was a lot of other cool stuff going on, but a lot of it was not the most pertinent things to WMCS, perhaps. We won’t get the f-strings that everyone’s excited about until Buster (and can’t use them reliably until that’s the old-stable), and the walrus operator won’t actually end up in Debian until…the Future (https://www.python.org/dev/peps/pep-0572/ <https://www.python.org/dev/peps/pep-0572/>). Apparently Python is also the primary language choice of dystopia (see also TensorFlow)—nothing new but really in-your-face at PyCon. It is also very clear that people would like to know how to deploy Python on Toolforge and our setup there. I was asked repeatedly to give a talk/demo, do an open space or have a development sprint on our stuff (dev sprints are hard when you are on Gerrit, though—the tutorial for other folks is 100% github/lab). I am interested in trying to do one or more of those next year if everything aligns. Brooke Storm Operations Engineer Wikimedia Cloud Services bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org> IRC: bstorm_

4 years, 11 months

Electric maintenance on 2019-05-16

by Arturo Borrero Gonzalez

Hi! on 2019-05-16 13:00 UTC there will be a maintenance operation in one of the Wikimedia Foundation datacenter racks that affects 2 of our servers running virtual machines [0]. There is a risk that this maintenance operation can result in power loss of the servers, affecting the virtual machines running on it. However, there is no way to know for sure if there will be any outage at all. If you are an admin of any of the VMs in the list and you want the VM to be reallocated into other servers previous to the operation, please get in touch with us as soon as possible. Remember that, right now, reallocating the VM to other server means shutting down the VM briefly. Here is a list of affected virtual machines: cloudvirt1028.eqiad.wmnet: af-puppetdb01.automation-framework.eqiad.wmflabs bastion-eqiad1-02.bastion.eqiad.wmflabs fridolin.catgraph.eqiad.wmflabs cloud-puppetmaster-02.cloudinfra.eqiad.wmflabs cloudstore-dev-01.cloudstore.eqiad.wmflabs commtech-nsfw.commtech.eqiad.wmflabs clm-test-01.community-labs-monitoring.eqiad.wmflabs cyberbot-exec-iabot-01.cyberbot.eqiad.wmflabs deployment-db05.deployment-prep.eqiad.wmflabs deployment-memc05.deployment-prep.eqiad.wmflabs deployment-sca01.deployment-prep.eqiad.wmflabs deployment-pdfrender02.deployment-prep.eqiad.wmflabs ign.ign2commons.eqiad.wmflabs integration-slave-docker-1050.integration.eqiad.wmflabs integration-castor03.integration.eqiad.wmflabs api.openocr.eqiad.wmflabs osmit-umap.osmit.eqiad.wmflabs builder-envoy.packaging.eqiad.wmflabs jmm-buster.puppet.eqiad.wmflabs a11y.reading-web-staging.eqiad.wmflabs adhoc-utils01.security-tools.eqiad.wmflabs util-abogott-stretch.testlabs.eqiad.wmflabs canary1028-01.testlabs.eqiad.wmflabs stretch.thumbor.eqiad.wmflabs tools-worker-1023.tools.eqiad.wmflabs tools-proxy-04.tools.eqiad.wmflabs tools-docker-builder-06.tools.eqiad.wmflabs tools-sgewebgrid-generic-0904.tools.eqiad.wmflabs tools-sgeexec-0942.tools.eqiad.wmflabs tools-sgeexec-0941.tools.eqiad.wmflabs tools-sgeexec-0940.tools.eqiad.wmflabs tools-sgeexec-0939.tools.eqiad.wmflabs tools-sgeexec-0937.tools.eqiad.wmflabs tools-sgeexec-0929.tools.eqiad.wmflabs tools-sgeexec-0921.tools.eqiad.wmflabs tools-sgeexec-0920.tools.eqiad.wmflabs tools-sgeexec-0911.tools.eqiad.wmflabs tools-sgeexec-0909.tools.eqiad.wmflabs toolsbeta-proxy-01.toolsbeta.eqiad.wmflabs vconverter-instance.videowiki.eqiad.wmflabs perfbot.webperf.eqiad.wmflabs wdhqs-1.wikidata-history-query-service.eqiad.wmflabs cloudvirt1014.eqiad.wmnet: commonsarchive-prod.commonsarchive.eqiad.wmflabs deployment-imagescaler03.deployment-prep.eqiad.wmflabs dumps-5.dumps.eqiad.wmflabs dumps-4.dumps.eqiad.wmflabs incubator-mw.incubator.eqiad.wmflabs webperformance.integration.eqiad.wmflabs saucelabs-01.integration.eqiad.wmflabs integration-puppetmaster01.integration.eqiad.wmflabs maps-puppetmaster.maps.eqiad.wmflabs maps-wma.maps.eqiad.wmflabs mwoffliner3.mwoffliner.eqiad.wmflabs mwoffliner1.mwoffliner.eqiad.wmflabs phlogiston-5.phlogiston.eqiad.wmflabs discovery-testing-01.shiny-r.eqiad.wmflabs snuggle-enwiki-01.snuggle.eqiad.wmflabs canary-1014-01.testlabs.eqiad.wmflabs tools-sgeexec-0901.tools.eqiad.wmflabs wdqs-test.wikidata-query.eqiad.wmflabs Toolforge won't be affected by this operation. You can read more details about the datacenter operation itself in phabricator [1]. Sorry for the short notice, regards. [0] Cloud Services: reallocate workload from rack B5-eqiad https://phabricator.wikimedia.org/T223148 [1] Install new PDUs into b5-eqiad https://phabricator.wikimedia.org/T223126 -- Arturo Borrero Gonzalez Operations Engineer / Wikimedia Cloud Services Wikimedia Foundation

4 years, 11 months

Open Infra conference notes

by Andrew Bogott

As always, 70% of this conference is about building fresh, new clouds rather than existing use-cases. That made for a very slow start on the first day, but there were some interesting bits later on. Mark Shuttleworth gave a brief talk where he re-affirmed Ubuntu's commitment to supporting OpenStack and K8s in the long-term, and then scolded attendees for getting distracted by (unspecified) shiny new things rather than focusing on the fundamentals. I'm not really sure what that was about but it was nice to hear someone assert that they still think OpenStack is fundamental to the future of cloud tech. The following is largely notes for my future self, but Brooke might be interested in reading up about Rook. Ceph/Rook: Everyone is using ceph! Everyone also talks a lot about how hard it is to deploy. There's a fair amount of buzz around 'Rook' which is a ceph deployment/management system that we might want to consider. As I understand it, you set up a k8s cluster with host networking on all of your OSD nodes, and then Rook dumps a pod on each node which implements the ceph services. Plenty of people are claiming that it works great, and I think it supports rolling upgrades so that might be something to consider instead of a bare puppet-and-debian-package deployment. Deployment/package management: There are lots of ways to deploy! Openstack on k8s, openstack on openstack, openstack in containers pushed out by ansible, etc. etc. Almost all of these assume that 1) you're starting from scratch and 2) you want/have ironic control of bare metal. I spent a while thinking that we should set up a k8s cluster and deploy openstack services there... 'airship' might support that model (and it would line up with using Rook to manage the ceph cluster) but I'm not sure that I'm not just looking for a problem to solve when we don't really have one. The one thing that might be useful for us is grabbing the kolla project packages and deploying on simple standalone docker instances... that would get us out of our current packaging hell. Assuming we don't ever want to patch the projects, this might be a decent alternative to deploying from source. Designate: The (two) designate developers are still alive and working on the project. Development is very slow-paced right now, which is mostly good for us because it means fewer headaches during upgrades :) Mugsie (the PTL) switched jobs but says he still has someone paying him to work on the project part-time, so there's no immediate danger of the project dying off. The Designate folks think that we should keep using designate-sink until we're running version O. Then we can switch to the proper REST-based neutron integration code for creating/deleting records on VM creation and deletion. We'll want to write our own custom Neutron plugin to replace the default one in order to replace the custom code that's currently running in Sink. The bad news is that the one feature I really want (the ability to share .wmflabs.org between multiple tenants) is on the back-burner for the moment. If money and staff dropped into our lap it might be nice for us to get some contractor dollars devoted to someone working on that (partly because I feel like we're a free-rider on the project and it seems starved for resources). Keystone: The keystone upstream is finally implementing system-wide scope for roles, which means that eventually we'll be able to give the 'observer' users a system-wide scope rather than having to add it to every single project. They're also in the process of standardizing on a true project-admin policy which would let us get rid of some of our hacks that allow project admins to add members to their own projects but not others. Of course, none of that is really useful until other projects have also adopted these concepts, so we won't see any real gains until T or U.

4 years, 11 months

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin May 2019