Hello Admins,
The build service is ready for the next phase, and we want to get some
early users to test it and give early feedback!
This next step is going to start this Monday, when around 100 tool
maintainers will kindly receive an email asking them to try and test the
service.
The full list will be made available in the task [1]
You can read a draft of the email here[2]
Later next week, there will also be a session at the Athens Hackathon
(thanks Slavina!) [3] introducing the build service to some new users too,
as new user experiences give an important perspective too.
We will tentatively stay in this period of feedback for ~1 month. At that
point we will sit back, reflect on the feedback (will share the thoughts
with you too), and decide if we are ready for a broader announcement
(cloud-announcement, blog post, …, to be defined), to start getting wider
input.
In preparation for this, the team (including volunteers) has been working
on setting up some minimal information on wikitech [4] and phabricator [5].
Feel free to add comments to the talk page [6] and/or make changes and
fixes to those pages.
There's still many things to figure out, and many more will show up during
this phase, your help and input will be critical in shaping this service.
I encourage you to try it out yourself if you have not, and open any bugs
that you find or feature requests that you think would be useful, find the
links for those in the feedback page [7].
You can see a more detailed planning in the task [1] (feel free to add
comments there too)
Thanks for your continued support!
[1]: https://phabricator.wikimedia.org/T335249
[2]: https://etherpad.wikimedia.org/p/tQr7TQr20xorkEXXk6Pj
[3]: https://phabricator.wikimedia.org/T336055
[4]: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service
[5]: https://phabricator.wikimedia.org/project/profile/6529/
[6]: https://wikitech.wikimedia.org/wiki/Help_talk:Toolforge/Build_Service
[7]:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Feedback
--
Seyram Komla Sapaty
Developer Advocate
Wikimedia Cloud Services
Hi there,
in the last quarter, we conducted a research [0] to evaluate and rethink how we
deploy and offer Clod VPS (openstack), i.e, our IaaS setup.
The research results were written into a wiki page [1] which summary being that
the most attractive option for us is to move to a kubernetes + openstack-helm
deployment of Cloud VPS / openstack in the future.
This project is not trivial, and may be actually a multi-year project, so when /
why / how this project will start is yet to be decided.
The wiki page [1] contains plenty of details about everything related to the
project. Also, a bunch of open questions and unknowns. Feel free to point to
gaps or missing information bits in the plans.
regards.
[0] https://phabricator.wikimedia.org/T326758
[1]
https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhanceme…
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
In response to our recent maintenance windows we got some feedback[0]
about advance notice of outages. I created this chart to provide us with
some internal guidelines about when we should publicize maintenance, and
how to do so:
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance_noti…
You will notice that at the moment my imagination is limited to 'write
to a mailing list.' I encourage people to fill in ideas on that page (or
the associated talk page) about other ways we can warn people about
these things. If we wind up with so many broadcast channels that it
becomes impractical to actually use them all we can invest in automation.
I'm also not especially committed to the brackets on that chart; I'd
like to have broad categories and low standards, but edits are welcome!
One thing that I want to be more mindful about is the distinction
between "things that mess with our users" (e.g. quarry or horizon
downtime) vs. "things that mess with our users' users" (e.g. web proxy
downtime.) I'd love it if someone with better wiki-editing skills
spruced up the chart to reflect that difference.
-A
[0] for example https://phabricator.wikimedia.org/T333477#8764263
On 3/30/23 12:42, Arturo Borrero Gonzalez wrote:
> On 3/28/23 00:13, Taavi Väänänen wrote:
>> Hi,
>>
>> We will be upgrading the Toolforge Kubernetes cluster next Monday (2023-04-03)
>> starting at around 10:00 UTC.
>>
>> The expected impact is that tools running on the Kubernetes cluster will get
>> restarted a couple of times over the course of the few hours it takes for us
>> to upgrade the entire cluster. The ability to manage tools will remain
>> operational.
>>
>> Since the version we're upgrading to (1.22) removes a bunch of deprecated
>> Kubernetes APIs, tools that use kubectl and raw Kubernetes resources directly
>> may want to check that they're on the latest available versions. The vast
>> majority of tools that are only using the Jobs framework and/or the webservice
>> command are not affected by these changes.
>>
>
> This has been rescheduled to Monday 2023-04-10 to leave room for the other
> operations we have.
>
Hi there!
This is happening now!
https://phabricator.wikimedia.org/T286856
regards.
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
Hi there,
if you are using lima-kilo [0], I just merged a change [1] to confine all
content into a directory (~/.local/toolforge-lima-kilo/) and that includes the
use configuration file:
* old: ~/.config/toolforge-lima-kilo-userconfig.yaml
* new: ~/.local/toolforge-lima-kilo/userconfig.yaml
While the update includes some logic to handle this change (it will relocate the
old file to the new location if you didn't do it beforehand), if you add new
config settings to the old file, it wont have any effect.
regards.
[0] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/lima-…
[1]
https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/commit/5d981…
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
Hi there!
Today 2023-03-06, in a few minutes, we will restart the Toolforge internal
network, A brief interruption of network communications is expected during the
maintenance.
This is because we're re-deploying calico to the kubernetes cluster [0].
No action required on your side.
regards.
[0] https://phabricator.wikimedia.org/T328539
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
Hi all,
There are a couple of major changes to our Cloud VPS o11y stack that I'm
planning to make in the near term. Most of this should be visible on
Phabricator as well, but I wanted to make everyone aware here regardless
since following activity on Phabricator is hard and I don't want to
cause any major surprises here.
I'm sure some of this will have an effect on our users and I/we need to
communicate it beforehand, but I'm not at that stage quite yet.
== 1. New Grafana instance: grafana.wmcloud.org ==
The first and hopefully least impactful change is replacing the current
grafana-cloud.wikimedia.org (aka grafana-labs.wikimedia.org) Grafana
instance with a new one. The reason is that the current one runs
directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream
Grafana changes it soon won't be able to reach out to Prometheus
instances living on cloud-vps VMs.
This work is tracked as T307465, and has patches up for review starting
from https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/.
== 2. Diamond removal ==
The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs.
This was the primary blocker for getting rid of Diamond (a Python 2
program that collected node metrics and pushed them to Graphite). I hope
that this transition will be mostly invisible to users if we migrate the
most used Grafana dashboard (cloud-vps-project-board) to pull the
metrics from Prometheus instead.
This is tracked as T317032.
== 3. Statsd/Graphite removal (once Diamond is gone) ==
My understanding is that the statsd/Graphite service was originally not
intended as a generic service for cloud-vps users (although it certainly
is used like one today). Either way we don't really have a good
replacement for it except some limited cases that could use
node-exporter text files instead. I'm not sure how big of a deal that is
if we never claimed to support it anyway?
This is tracked as T326266.
Any questions or comments on the above?
Taavi
Hi there,
The Toolforge jobs framework just got upgraded with a few new features:
* support for custom logs
* support for job failure retry policy
* new behavior with job image listing
* some initial validation of YAML files
The documentation should be mostly up-to-date in wikitech:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework
You can stop reading here unless you want more details :-)
The custom log files feature will allow you do things like:
* using a custom directory to store log files
* merging stdout/stderr logs together into a single file
* ignoring one of the two log streams
The job retry policy allows to instruct the computing engine to restart jobs
that failed, up to 5 times.
Job images are now listed in a different format, and deprecated images are
hidden by default, to encourage usage of newer ones.
Regarding the YAML validation, the toolforge-jobs utility will now emit a
warning if some key is unknown. We plan to make this more robust in the future,
also providing a schema file.
We don't usually announce upgrades, but this one in particular contained much
awaited features. This is the result of hard work by several folks, in
particular Taavi (community member) and Raymond (WMF contractor).
Happy `toolforging`. Regards.
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
Hi,
following the discussion in https://phabricator.wikimedia.org/T322756
yesterday I made some changes to the cloud-services-team Phabricator
boards.
The main change is that most tasks have been moved from the
"cloud-services-team (Kanban)" milestone board [1] to the
"cloud-services-team" project board [2]. Columns have retained similar
names, but there is a new column "FY2022/2023-Q3" that includes tasks
that have been prioritized or are being actively worked on in the
current quarter.
Clicking on the title of that column will take you to a "zoomed in"
view of those tasks [3] where they are divided into 4 columns:
Backlog, In progress, Blocked and Done.
I went through the tasks that were in the "Doing" column of the old
kanban board, moved the ones that had recent activity to "In progress"
in the Q3 board, and moved back to "Inbox" the tasks that didn't seem
to have any recent activity. Feel free to move tasks to a more
appropriate column if you're planning to work on them soon.
While these boards are primarily used by members of the WMCS team, I
imagine they might be checked by people outside the team as well, so
I'm sending a quick heads-up to this wider list. This isn't likely to
be the final state of the boards, but I hope that iterating on their
shape will lead us to a place where the boards are more useful for
people inside and outside of the team.
If you have any comments or concerns, please leave a comment in the
follow-up task at https://phabricator.wikimedia.org/T327309
[1] https://phabricator.wikimedia.org/project/board/2774/
[2] https://phabricator.wikimedia.org/project/board/2773/
[3] https://phabricator.wikimedia.org/project/board/6358/
Thanks,
Francesco
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation
Hi there,
the Toolforge jobs service [0] (the one you would use via the `toolforge-jobs`
command line interface) will have a brief maintenance today 2023-01-10 @ 11:30
UTC (in about 15 minutes).
We need to restart the API service and it will be down for a couple of minutes
(perhaps even less).
During that time, using the toolforge-jobs command line interface will most
likely fail.
regards.
[0] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation