Hi all,
There are a couple of major changes to our Cloud VPS o11y stack that I'm
planning to make in the near term. Most of this should be visible on
Phabricator as well, but I wanted to make everyone aware here regardless
since following activity on Phabricator is hard and I don't want to
cause any major surprises here.
I'm sure some of this will have an effect on our users and I/we need to
communicate it beforehand, but I'm not at that stage quite yet.
== 1. New Grafana instance: grafana.wmcloud.org ==
The first and hopefully least impactful change is replacing the current
grafana-cloud.wikimedia.org (aka grafana-labs.wikimedia.org) Grafana
instance with a new one. The reason is that the current one runs
directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream
Grafana changes it soon won't be able to reach out to Prometheus
instances living on cloud-vps VMs.
This work is tracked as T307465, and has patches up for review starting
from https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/.
== 2. Diamond removal ==
The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs.
This was the primary blocker for getting rid of Diamond (a Python 2
program that collected node metrics and pushed them to Graphite). I hope
that this transition will be mostly invisible to users if we migrate the
most used Grafana dashboard (cloud-vps-project-board) to pull the
metrics from Prometheus instead.
This is tracked as T317032.
== 3. Statsd/Graphite removal (once Diamond is gone) ==
My understanding is that the statsd/Graphite service was originally not
intended as a generic service for cloud-vps users (although it certainly
is used like one today). Either way we don't really have a good
replacement for it except some limited cases that could use
node-exporter text files instead. I'm not sure how big of a deal that is
if we never claimed to support it anyway?
This is tracked as T326266.
Any questions or comments on the above?
Taavi