- Cloud-admin - lists.wikimedia.org

Toolforge Build Service beta testing
by Seyram Komla Sapaty 12 May '23

12 May '23

Hello Admins, The build service is ready for the next phase, and we want to get some early users to test it and give early feedback! This next step is going to start this Monday, when around 100 tool maintainers will kindly receive an email asking them to try and test the service. The full list will be made available in the task [1] You can read a draft of the email here[2] Later next week, there will also be a session at the Athens Hackathon (thanks Slavina!) [3] introducing the build service to some new users too, as new user experiences give an important perspective too. We will tentatively stay in this period of feedback for ~1 month. At that point we will sit back, reflect on the feedback (will share the thoughts with you too), and decide if we are ready for a broader announcement (cloud-announcement, blog post, …, to be defined), to start getting wider input. In preparation for this, the team (including volunteers) has been working on setting up some minimal information on wikitech [4] and phabricator [5]. Feel free to add comments to the talk page [6] and/or make changes and fixes to those pages. There's still many things to figure out, and many more will show up during this phase, your help and input will be critical in shaping this service. I encourage you to try it out yourself if you have not, and open any bugs that you find or feature requests that you think would be useful, find the links for those in the feedback page [7]. You can see a more detailed planning in the task [1] (feel free to add comments there too) Thanks for your continued support! [1]: https://phabricator.wikimedia.org/T335249 [2]: https://etherpad.wikimedia.org/p/tQr7TQr20xorkEXXk6Pj [3]: https://phabricator.wikimedia.org/T336055 [4]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service [5]: https://phabricator.wikimedia.org/project/profile/6529/ [6]: https://wikitech.wikimedia.org/wiki/Help_talk:Toolforge/Build_Service [7]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Feedback -- Seyram Komla Sapaty Developer Advocate Wikimedia Cloud Services

3 7

Rethinking how we deploy and offer Cloud VPS (openstack)
by Arturo Borrero Gonzalez 12 May '23

12 May '23

Hi there, in the last quarter, we conducted a research [0] to evaluate and rethink how we deploy and offer Clod VPS (openstack), i.e, our IaaS setup. The research results were written into a wiki page [1] which summary being that the most attractive option for us is to move to a kubernetes + openstack-helm deployment of Cloud VPS / openstack in the future. This project is not trivial, and may be actually a multi-year project, so when / why / how this project will start is yet to be decided. The wiki page [1] contains plenty of details about everything related to the project. Also, a bunch of open questions and unknowns. Feel free to point to gaps or missing information bits in the plans. regards. [0] https://phabricator.wikimedia.org/T326758 [1] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme… -- Arturo Borrero Gonzalez Senior SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

How and when to notify about maintenance
by Andrew Bogott 17 Apr '23

17 Apr '23

In response to our recent maintenance windows we got some feedback[0] about advance notice of outages. I created this chart to provide us with some internal guidelines about when we should publicize maintenance, and how to do so: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance_noti… You will notice that at the moment my imagination is limited to 'write to a mailing list.' I encourage people to fill in ideas on that page (or the associated talk page) about other ways we can warn people about these things. If we wind up with so many broadcast channels that it becomes impractical to actually use them all we can invest in automation. I'm also not especially committed to the brackets on that chart; I'd like to have broad categories and low standards, but edits are welcome! One thing that I want to be more mindful about is the distinction between "things that mess with our users" (e.g. quarry or horizon downtime) vs. "things that mess with our users' users" (e.g. web proxy downtime.) I'd love it if someone with better wiki-editing skills spruced up the chart to reflect that difference. -A [0] for example https://phabricator.wikimedia.org/T333477#8764263

2 1

Re: [Cloud-announce] Toolforge Kubernetes upgrade on 2023-04-03 (new date: 2023-04-10)
by Arturo Borrero Gonzalez 10 Apr '23

10 Apr '23

On 3/30/23 12:42, Arturo Borrero Gonzalez wrote: > On 3/28/23 00:13, Taavi Väänänen wrote: >> Hi, >> >> We will be upgrading the Toolforge Kubernetes cluster next Monday (2023-04-03) >> starting at around 10:00 UTC. >> >> The expected impact is that tools running on the Kubernetes cluster will get >> restarted a couple of times over the course of the few hours it takes for us >> to upgrade the entire cluster. The ability to manage tools will remain >> operational. >> >> Since the version we're upgrading to (1.22) removes a bunch of deprecated >> Kubernetes APIs, tools that use kubectl and raw Kubernetes resources directly >> may want to check that they're on the latest available versions. The vast >> majority of tools that are only using the Jobs framework and/or the webservice >> command are not affected by these changes. >> > > This has been rescheduled to Monday 2023-04-10 to leave room for the other > operations we have. > Hi there! This is happening now! https://phabricator.wikimedia.org/T286856 regards. -- Arturo Borrero Gonzalez Senior SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

lima-kilo: heads up, new directory tree
by Arturo Borrero Gonzalez 10 Mar '23

10 Mar '23

Hi there, if you are using lima-kilo [0], I just merged a change [1] to confine all content into a directory (~/.local/toolforge-lima-kilo/) and that includes the use configuration file: * old: ~/.config/toolforge-lima-kilo-userconfig.yaml * new: ~/.local/toolforge-lima-kilo/userconfig.yaml While the update includes some logic to handle this change (it will relocate the old file to the new location if you didn't do it beforehand), if you add new config settings to the old file, it wont have any effect. regards. [0] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/lima-… [1] https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/commit/5d981… -- Arturo Borrero Gonzalez Senior SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

Toolforge: brief network maintenance today 2023-03-06
by Arturo Borrero Gonzalez 06 Mar '23

06 Mar '23

Hi there! Today 2023-03-06, in a few minutes, we will restart the Toolforge internal network, A brief interruption of network communications is expected during the maintenance. This is because we're re-deploying calico to the kubernetes cluster [0]. No action required on your side. regards. [0] https://phabricator.wikimedia.org/T328539 -- Arturo Borrero Gonzalez Senior SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

Some upcoming changes to the Cloud VPS metrics stack
by Taavi Väänänen 07 Feb '23

07 Feb '23

Hi all, There are a couple of major changes to our Cloud VPS o11y stack that I'm planning to make in the near term. Most of this should be visible on Phabricator as well, but I wanted to make everyone aware here regardless since following activity on Phabricator is hard and I don't want to cause any major surprises here. I'm sure some of this will have an effect on our users and I/we need to communicate it beforehand, but I'm not at that stage quite yet. == 1. New Grafana instance: grafana.wmcloud.org == The first and hopefully least impactful change is replacing the current grafana-cloud.wikimedia.org (aka grafana-labs.wikimedia.org) Grafana instance with a new one. The reason is that the current one runs directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream Grafana changes it soon won't be able to reach out to Prometheus instances living on cloud-vps VMs. This work is tracked as T307465, and has patches up for review starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/. == 2. Diamond removal == The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs. This was the primary blocker for getting rid of Diamond (a Python 2 program that collected node metrics and pushed them to Graphite). I hope that this transition will be mostly invisible to users if we migrate the most used Grafana dashboard (cloud-vps-project-board) to pull the metrics from Prometheus instead. This is tracked as T317032. == 3. Statsd/Graphite removal (once Diamond is gone) == My understanding is that the statsd/Graphite service was originally not intended as a generic service for cloud-vps users (although it certainly is used like one today). Either way we don't really have a good replacement for it except some limited cases that could use node-exporter text files instead. I'm not sure how big of a deal that is if we never claimed to support it anyway? This is tracked as T326266. Any questions or comments on the above? Taavi

2 2

New toolforge-jobs features
by Arturo Borrero Gonzalez 24 Jan '23

24 Jan '23

Hi there, The Toolforge jobs framework just got upgraded with a few new features: * support for custom logs * support for job failure retry policy * new behavior with job image listing * some initial validation of YAML files The documentation should be mostly up-to-date in wikitech: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework You can stop reading here unless you want more details :-) The custom log files feature will allow you do things like: * using a custom directory to store log files * merging stdout/stderr logs together into a single file * ignoring one of the two log streams The job retry policy allows to instruct the computing engine to restart jobs that failed, up to 5 times. Job images are now listed in a different format, and deprecated images are hidden by default, to encourage usage of newer ones. Regarding the YAML validation, the toolforge-jobs utility will now emit a warning if some key is unknown. We plan to make this more robust in the future, also providing a schema file. We don't usually announce upgrades, but this one in particular contained much awaited features. This is the result of hard work by several folks, in particular Taavi (community member) and Raymond (WMF contractor). Happy `toolforging`. Regards. -- Arturo Borrero Gonzalez Senior SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

Changes to WMCS Phabricator boards
by Francesco Negri 19 Jan '23

19 Jan '23

Hi, following the discussion in https://phabricator.wikimedia.org/T322756 yesterday I made some changes to the cloud-services-team Phabricator boards. The main change is that most tasks have been moved from the "cloud-services-team (Kanban)" milestone board [1] to the "cloud-services-team" project board [2]. Columns have retained similar names, but there is a new column "FY2022/2023-Q3" that includes tasks that have been prioritized or are being actively worked on in the current quarter. Clicking on the title of that column will take you to a "zoomed in" view of those tasks [3] where they are divided into 4 columns: Backlog, In progress, Blocked and Done. I went through the tasks that were in the "Doing" column of the old kanban board, moved the ones that had recent activity to "In progress" in the Q3 board, and moved back to "Inbox" the tasks that didn't seem to have any recent activity. Feel free to move tasks to a more appropriate column if you're planning to work on them soon. While these boards are primarily used by members of the WMCS team, I imagine they might be checked by people outside the team as well, so I'm sending a quick heads-up to this wider list. This isn't likely to be the final state of the boards, but I hope that iterating on their shape will lead us to a place where the boards are more useful for people inside and outside of the team. If you have any comments or concerns, please leave a comment in the follow-up task at https://phabricator.wikimedia.org/T327309 [1] https://phabricator.wikimedia.org/project/board/2774/ [2] https://phabricator.wikimedia.org/project/board/2773/ [3] https://phabricator.wikimedia.org/project/board/6358/ Thanks, Francesco -- Francesco Negri (he/him) -- IRC: dhinus Site Reliability Engineer, Cloud Services team Wikimedia Foundation

1 0

Toolforge jobs: briefly maintenance today 2023-01-10 @ 11:30 UTC
by Arturo Borrero Gonzalez 10 Jan '23

10 Jan '23

Hi there, the Toolforge jobs service [0] (the one you would use via the `toolforge-jobs` command line interface) will have a brief maintenance today 2023-01-10 @ 11:30 UTC (in about 15 minutes). We need to restart the API service and it will be down for a couple of minutes (perhaps even less). During that time, using the toolforge-jobs command line interface will most likely fail. regards. [0] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework -- Arturo Borrero Gonzalez Senior SRE / Wikimedia Cloud Services Wikimedia Foundation

1 0

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin