Hi all,
There are a couple of major changes to our Cloud VPS o11y stack that I'm planning to make in the near term. Most of this should be visible on Phabricator as well, but I wanted to make everyone aware here regardless since following activity on Phabricator is hard and I don't want to cause any major surprises here.
I'm sure some of this will have an effect on our users and I/we need to communicate it beforehand, but I'm not at that stage quite yet.
== 1. New Grafana instance: grafana.wmcloud.org ==
The first and hopefully least impactful change is replacing the current grafana-cloud.wikimedia.org (aka grafana-labs.wikimedia.org) Grafana instance with a new one. The reason is that the current one runs directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream Grafana changes it soon won't be able to reach out to Prometheus instances living on cloud-vps VMs.
This work is tracked as T307465, and has patches up for review starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/.
== 2. Diamond removal ==
The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs. This was the primary blocker for getting rid of Diamond (a Python 2 program that collected node metrics and pushed them to Graphite). I hope that this transition will be mostly invisible to users if we migrate the most used Grafana dashboard (cloud-vps-project-board) to pull the metrics from Prometheus instead.
This is tracked as T317032.
== 3. Statsd/Graphite removal (once Diamond is gone) ==
My understanding is that the statsd/Graphite service was originally not intended as a generic service for cloud-vps users (although it certainly is used like one today). Either way we don't really have a good replacement for it except some limited cases that could use node-exporter text files instead. I'm not sure how big of a deal that is if we never claimed to support it anyway?
This is tracked as T326266.
Any questions or comments on the above?
Taavi
On Wed, Jan 4, 2023 at 12:41 PM Taavi Väänänen hi@taavi.wtf wrote:
Hi all,
There are a couple of major changes to our Cloud VPS o11y stack that I'm planning to make in the near term. Most of this should be visible on Phabricator as well, but I wanted to make everyone aware here regardless since following activity on Phabricator is hard and I don't want to cause any major surprises here.
Thank you for thinking about it this way. You are totally correct that I have seen things happening in Phab and Gerrit, but I didn't know where things were at on a more holistic level.
I'm sure some of this will have an effect on our users and I/we need to communicate it beforehand, but I'm not at that stage quite yet.
== 1. New Grafana instance: grafana.wmcloud.org ==
The first and hopefully least impactful change is replacing the current grafana-cloud.wikimedia.org (aka grafana-labs.wikimedia.org) Grafana instance with a new one. The reason is that the current one runs directly on hardware (cloudmetrics*.eqiad.wmnet), and due to upstream Grafana changes it soon won't be able to reach out to Prometheus instances living on cloud-vps VMs.
This work is tracked as T307465, and has patches up for review starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/869210/.
Can we leave an HTTP redirect running somewhere to make old links at least end up on grafana.wmcloud.org even if we can't guarantee that the dashboard they led to is still around? My fingers and Firefox awesomebar completion will thank you! :)
== 2. Diamond removal ==
The Prometheus instance in metricsinfra now scrapes all Cloud VPS VMs. This was the primary blocker for getting rid of Diamond (a Python 2 program that collected node metrics and pushed them to Graphite). I hope that this transition will be mostly invisible to users if we migrate the most used Grafana dashboard (cloud-vps-project-board) to pull the metrics from Prometheus instead.
This is tracked as T317032.
The main thing I know that Diamond removal has affected is Timo's https://nagf.toolforge.org/ tool. I think the cloud-vps-project-board Grafana dashboard is a reasonable replacement for that tool that Timo built long ago to do pretty much the same job, but there are some places that we should replace links to NAGF with links to the new Grafana instead. Openstack-browser and [[wikitech:Template:Nova Resource]] are the ones I'm remembering at the moment.
== 3. Statsd/Graphite removal (once Diamond is gone) ==
My understanding is that the statsd/Graphite service was originally not intended as a generic service for cloud-vps users (although it certainly is used like one today). Either way we don't really have a good replacement for it except some limited cases that could use node-exporter text files instead. I'm not sure how big of a deal that is if we never claimed to support it anyway?
This is tracked as T326266.
Deployment-prep's https://phabricator.wikimedia.org/T241285 is a thing, but maybe we don't actually have a strong use case for a replacement as Taavi has noted there in the past. Jean-Fred also uses it from Toolforge (https://phabricator.wikimedia.org/T325936). ORES used to use it too, but that may be all dead tech at this point as well. It is also probably time to close https://phabricator.wikimedia.org/T241284 as WONTFIX.
Bryan
I've started a news page on Wikitech for these changes[0]. If you have concerns for publicly announcing these or have feedback on the any possible timelines, now is the time for that. Otherwise I am going to pick some dates and send an email to cloud-announce in the next few days.
[0]: https://wikitech.wikimedia.org/wiki/News/2023_Cloud_VPS_metrics_changes
Taavi
cloud-admin@lists.wikimedia.org