Hi all,
There are a couple of major changes to our Cloud VPS o11y stack that I'm planning to make in the near term. Most of this should be visible on Phabricator as well, but I wanted to make everyone aware here regardless since following activity on Phabricator is hard and I don't want to cause any major surprises here.
I'm sure some of this will have an effect on our users and I/we need to communicate it beforehand, but I'm not at that stage quite yet.
== 1. New Grafana instance: grafana.wmcloud.org ==
The first and hopefully least impactful change is replacing the current grafana-cloud.wikimedia.org (aka grafana-labs.wikimedia.org) Grafana instance with a new one. The reason is that the current one runs directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream Grafana changes it soon won't be able to reach out to Prometheus instances living on cloud-vps VMs.
This work is tracked as T307465, and has patches up for review starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/.
== 2. Diamond removal ==
The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs. This was the primary blocker for getting rid of Diamond (a Python 2 program that collected node metrics and pushed them to Graphite). I hope that this transition will be mostly invisible to users if we migrate the most used Grafana dashboard (cloud-vps-project-board) to pull the metrics from Prometheus instead.
This is tracked as T317032.
== 3. Statsd/Graphite removal (once Diamond is gone) ==
My understanding is that the statsd/Graphite service was originally not intended as a generic service for cloud-vps users (although it certainly is used like one today). Either way we don't really have a good replacement for it except some limited cases that could use node-exporter text files instead. I'm not sure how big of a deal that is if we never claimed to support it anyway?
This is tracked as T326266.
Any questions or comments on the above?
Taavi