Hi all,
There are a couple of major changes to our Cloud VPS o11y stack that I'm
planning to make in the near term. Most of this should be visible on
Phabricator as well, but I wanted to make everyone aware here regardless
since following activity on Phabricator is hard and I don't want to
cause any major surprises here.
I'm sure some of this will have an effect on our users and I/we need to
communicate it beforehand, but I'm not at that stage quite yet.
== 1. New Grafana instance: grafana.wmcloud.org ==
The first and hopefully least impactful change is replacing the current
grafana-cloud.wikimedia.org (aka grafana-labs.wikimedia.org) Grafana
instance with a new one. The reason is that the current one runs
directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream
Grafana changes it soon won't be able to reach out to Prometheus
instances living on cloud-vps VMs.
This work is tracked as T307465, and has patches up for review starting
from https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/.
== 2. Diamond removal ==
The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs.
This was the primary blocker for getting rid of Diamond (a Python 2
program that collected node metrics and pushed them to Graphite). I hope
that this transition will be mostly invisible to users if we migrate the
most used Grafana dashboard (cloud-vps-project-board) to pull the
metrics from Prometheus instead.
This is tracked as T317032.
== 3. Statsd/Graphite removal (once Diamond is gone) ==
My understanding is that the statsd/Graphite service was originally not
intended as a generic service for cloud-vps users (although it certainly
is used like one today). Either way we don't really have a good
replacement for it except some limited cases that could use
node-exporter text files instead. I'm not sure how big of a deal that is
if we never claimed to support it anyway?
This is tracked as T326266.
Any questions or comments on the above?
Taavi
Hi there,
The Toolforge jobs framework just got upgraded with a few new features:
* support for custom logs
* support for job failure retry policy
* new behavior with job image listing
* some initial validation of YAML files
The documentation should be mostly up-to-date in wikitech:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework
You can stop reading here unless you want more details :-)
The custom log files feature will allow you do things like:
* using a custom directory to store log files
* merging stdout/stderr logs together into a single file
* ignoring one of the two log streams
The job retry policy allows to instruct the computing engine to restart jobs
that failed, up to 5 times.
Job images are now listed in a different format, and deprecated images are
hidden by default, to encourage usage of newer ones.
Regarding the YAML validation, the toolforge-jobs utility will now emit a
warning if some key is unknown. We plan to make this more robust in the future,
also providing a schema file.
We don't usually announce upgrades, but this one in particular contained much
awaited features. This is the result of hard work by several folks, in
particular Taavi (community member) and Raymond (WMF contractor).
Happy `toolforging`. Regards.
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation
Hi,
following the discussion in https://phabricator.wikimedia.org/T322756
yesterday I made some changes to the cloud-services-team Phabricator
boards.
The main change is that most tasks have been moved from the
"cloud-services-team (Kanban)" milestone board [1] to the
"cloud-services-team" project board [2]. Columns have retained similar
names, but there is a new column "FY2022/2023-Q3" that includes tasks
that have been prioritized or are being actively worked on in the
current quarter.
Clicking on the title of that column will take you to a "zoomed in"
view of those tasks [3] where they are divided into 4 columns:
Backlog, In progress, Blocked and Done.
I went through the tasks that were in the "Doing" column of the old
kanban board, moved the ones that had recent activity to "In progress"
in the Q3 board, and moved back to "Inbox" the tasks that didn't seem
to have any recent activity. Feel free to move tasks to a more
appropriate column if you're planning to work on them soon.
While these boards are primarily used by members of the WMCS team, I
imagine they might be checked by people outside the team as well, so
I'm sending a quick heads-up to this wider list. This isn't likely to
be the final state of the boards, but I hope that iterating on their
shape will lead us to a place where the boards are more useful for
people inside and outside of the team.
If you have any comments or concerns, please leave a comment in the
follow-up task at https://phabricator.wikimedia.org/T327309
[1] https://phabricator.wikimedia.org/project/board/2774/
[2] https://phabricator.wikimedia.org/project/board/2773/
[3] https://phabricator.wikimedia.org/project/board/6358/
Thanks,
Francesco
--
Francesco Negri (he/him) -- IRC: dhinus
Site Reliability Engineer, Cloud Services team
Wikimedia Foundation
Hi there,
the Toolforge jobs service [0] (the one you would use via the `toolforge-jobs`
command line interface) will have a brief maintenance today 2023-01-10 @ 11:30
UTC (in about 15 minutes).
We need to restart the API service and it will be down for a couple of minutes
(perhaps even less).
During that time, using the toolforge-jobs command line interface will most
likely fail.
regards.
[0] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework
--
Arturo Borrero Gonzalez
Senior SRE / Wikimedia Cloud Services
Wikimedia Foundation