Cloud-admin February 2023

cloud-admin@lists.wikimedia.org

1 participants
1 discussions

Some upcoming changes to the Cloud VPS metrics stack
by Taavi Väänänen 07 Feb '23

07 Feb '23

Hi all, There are a couple of major changes to our Cloud VPS o11y stack that I'm planning to make in the near term. Most of this should be visible on Phabricator as well, but I wanted to make everyone aware here regardless since following activity on Phabricator is hard and I don't want to cause any major surprises here. I'm sure some of this will have an effect on our users and I/we need to communicate it beforehand, but I'm not at that stage quite yet. == 1. New Grafana instance: grafana.wmcloud.org == The first and hopefully least impactful change is replacing the current grafana-cloud.wikimedia.org (aka grafana-labs.wikimedia.org) Grafana instance with a new one. The reason is that the current one runs directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream Grafana changes it soon won't be able to reach out to Prometheus instances living on cloud-vps VMs. This work is tracked as T307465, and has patches up for review starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/. == 2. Diamond removal == The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs. This was the primary blocker for getting rid of Diamond (a Python 2 program that collected node metrics and pushed them to Graphite). I hope that this transition will be mostly invisible to users if we migrate the most used Grafana dashboard (cloud-vps-project-board) to pull the metrics from Prometheus instead. This is tracked as T317032. == 3. Statsd/Graphite removal (once Diamond is gone) == My understanding is that the statsd/Graphite service was originally not intended as a generic service for cloud-vps users (although it certainly is used like one today). Either way we don't really have a good replacement for it except some limited cases that could use node-exporter text files instead. I'm not sure how big of a deal that is if we never claimed to support it anyway? This is tracked as T326266. Any questions or comments on the above? Taavi

2 2

2024

2023

2022

2021

2020

2019

2018

2017

Cloud-admin February 2023