I think we could start monitoring prometheus-node-exporter on all Cloud VPS VMs on all projects via the Prometheus instance in metricsinfra. The required firewall rules are now in place (thanks to Andrew in T288108), and I've written the required patches to cloud/metricsinfra/prometheus-manager and to the Puppet repo:
https://gerrit.wikimedia.org/r/c/cloud/metricsinfra/prometheus-manager/+/856... https://gerrit.wikimedia.org/r/c/operations/puppet/+/856917/
The main effect this will have is that we (and project admins, of course) will have basic metrics (think CPU, disk, RAM, so on) for all instances in all projects. Currently these wouldn't send any alerts unless manually configured by a metricsinfra admin.
Please let me know if you have any questions or concerns, otherwise I'd like to move forward in the next few days.
Taavi
Thanks for this! I'm both excited and dismayed to see an extra 150 alerts on my dashboard this morning :)
On 11/15/22 3:07 AM, Taavi Väänänen wrote:
I think we could start monitoring prometheus-node-exporter on all Cloud VPS VMs on all projects via the Prometheus instance in metricsinfra. The required firewall rules are now in place (thanks to Andrew in T288108), and I've written the required patches to cloud/metricsinfra/prometheus-manager and to the Puppet repo:
https://gerrit.wikimedia.org/r/c/cloud/metricsinfra/prometheus-manager/+/856...
https://gerrit.wikimedia.org/r/c/operations/puppet/+/856917/
The main effect this will have is that we (and project admins, of course) will have basic metrics (think CPU, disk, RAM, so on) for all instances in all projects. Currently these wouldn't send any alerts unless manually configured by a metricsinfra admin.
Please let me know if you have any questions or concerns, otherwise I'd like to move forward in the next few days.
Taavi
Cloud-admin mailing list -- cloud-admin@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud-admin.lists.wikimedia.org/
cloud-admin@lists.wikimedia.org