On Tue, Dec 15, 2020 at 11:25 AM Roy Smith roy@panix.com wrote:
Thanks. Backing up a step, what I'm looking to do is build some kind of performance and monitoring dashboard for my tool. From what you say, maybe Thanos is not the right thing for that?
Thanos is an aggregating data store for the Prometheus metrics that we collect in the Wikimedia production network. We do not ship any metrics for Cloud VPS or Toolforge into that environment.
We have some metrics available for Toolforge tools, but not as many as we would like. The best monitoring we have currently is for the Toolforge Kubernetes cluster and the workloads that run there. The k8s-status tool shows read-only information about the Toolforge Kubernetes cluster. At https://k8s-status.toolforge.org/namespaces/tool-slow-parse/ you can see information about Roy's slow-parse tool. From there you can follow the 'Grafana dashboard' link to a Grafana dashboard that shows collected metrics about the Pods that have run in the slow-parse tool's Kubernetes namespace.
Somedayâ„¢ we will make time to build out more monitoring for both Toolforge tools and Cloud VPS tenants. There are several Phabricator tasks in the extended backlog with wishes that folks have made about such things. https://phabricator.wikimedia.org/T194333 is one that has some really high level ideas on it and some more concrete subtasks.
Bryan