[Cloud-admin] Some upcoming changes to the Cloud VPS metrics stack

4 Jan 2023


      Hi all,
There are a couple of major changes to our Cloud VPS o11y stack that I'm 
planning to make in the near term. Most of this should be visible on 
Phabricator as well, but I wanted to make everyone aware here regardless 
since following activity on Phabricator is hard and I don't want to 
cause any major surprises here.
I'm sure some of this will have an effect on our users and I/we need to 
communicate it beforehand, but I'm not at that stage quite yet.
== 1. New Grafana instance: grafana.wmcloud.org ==
The first and hopefully least impactful change is replacing the current 
grafana-cloud.wikimedia.org (aka grafana-labs.wikimedia.org) Grafana 
instance with a new one. The reason is that the current one runs 
directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream 
Grafana changes it soon won't be able to reach out to Prometheus 
instances living on cloud-vps VMs.
This work is tracked as T307465, and has patches up for review starting 
from https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/.
== 2. Diamond removal ==
The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs. 
This was the primary blocker for getting rid of Diamond (a Python 2 
program that collected node metrics and pushed them to Graphite). I hope 
that this transition will be mostly invisible to users if we migrate the 
most used Grafana dashboard (cloud-vps-project-board) to pull the 
metrics from Prometheus instead.
This is tracked as T317032.
== 3. Statsd/Graphite removal (once Diamond is gone) ==
My understanding is that the statsd/Graphite service was originally not 
intended as a generic service for cloud-vps users (although it certainly 
is used like one today). Either way we don't really have a good 
replacement for it except some limited cases that could use 
node-exporter text files instead. I'm not sure how big of a deal that is 
if we never claimed to support it anyway?
This is tracked as T326266.
Any questions or comments on the above?
Taavi

2024

2023

2022

2021

2020

2019

2018

2017

[Cloud-admin] Some upcoming changes to the Cloud VPS metrics stack