I want to build something to monitor the [[WP:DYK]] system on enwiki. I want to look at the length of various queues: nominations, approved nominations, number of hook sets ready for publication, perhaps a few more. Update times will be perhaps as low as once per day, certainly no faster than once per hour. Initially all I want to do is graph these. Eventually I might want to do some alerting.
In the old days, I would just have a simple script that threw some numbers as statsd. Looking at https://wikitech.wikimedia.org/wiki/Prometheus, it looks like that translates into using the pushgateway, but it's far from clear what I need to do to set this up. The docs talk about puppet, and certificates. Can somebody walk me through the setup?
Are the statistics you want to monitor available now over https queries? In logs somewhere?
Basically with Prometheus you start with “there exists a metric/statistic that can be queried”, then you start having Prometheus query it by adding it to the configuration.
However, it’s really a Prometheus Problem when you’re talking about updating everything every minute or second, usually when checking across many or all hosts. A daily or hourly query in Prometheus across a whole wiki not on each host is really kind of silly. It can do that, sure, but maybe you do just use a script instead?
If the standard now is Prometheus Everything then I guess that’s ok, but figuring out where to source the metrics is the next problem. Implementation after that is fairly easy…
-george
Sent from my iPhone
On May 1, 2025, at 6:40 AM, Roy Smith roy@panix.com wrote:
I want to build something to monitor the [[WP:DYK]] system on enwiki. I want to look at the length of various queues: nominations, approved nominations, number of hook sets ready for publication, perhaps a few more. Update times will be perhaps as low as once per day, certainly no faster than once per hour. Initially all I want to do is graph these. Eventually I might want to do some alerting.
In the old days, I would just have a simple script that threw some numbers as statsd. Looking at https://wikitech.wikimedia.org/wiki/Prometheus, it looks like that translates into using the pushgateway, but it's far from clear what I need to do to set this up. The docs talk about puppet, and certificates. Can somebody walk me through the setup?
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
On May 1, 2025, at 12:54 PM, George Herbert george.herbert@gmail.com wrote:
Are the statistics you want to monitor available now over https queries? In logs somewhere?
No to both of those.
Basically with Prometheus you start with “there exists a metric/statistic that can be queried”, then you start having Prometheus query it by adding it to the configuration.
However, it’s really a Prometheus Problem when you’re talking about updating everything every minute or second, usually when checking across many or all hosts. A daily or hourly query in Prometheus across a whole wiki not on each host is really kind of silly. It can do that, sure, but maybe you do just use a script instead?
I want a graph vs time.. Which is what statsd/graphite was good at, so I assumed Prometheus would also be good at it. Why is this silly?
If the standard now is Prometheus Everything then I guess that’s ok, but figuring out where to source the metrics is the next problem. Implementation after that is fairly easy…
-george
Sent from my iPhone
On May 1, 2025, at 6:40 AM, Roy Smith roy@panix.com wrote:
I want to build something to monitor the [[WP:DYK]] system on enwiki. I want to look at the length of various queues: nominations, approved nominations, number of hook sets ready for publication, perhaps a few more. Update times will be perhaps as low as once per day, certainly no faster than once per hour. Initially all I want to do is graph these. Eventually I might want to do some alerting.
In the old days, I would just have a simple script that threw some numbers as statsd. Looking at https://wikitech.wikimedia.org/wiki/Prometheus, it looks like that translates into using the pushgateway, but it's far from clear what I need to do to set this up. The docs talk about puppet, and certificates. Can somebody walk me through the setup?
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Il 01/05/25 20:17, Roy Smith ha scritto:
I want a graph vs time.. Which is what statsd/graphite was good at, so I assumed Prometheus would also be good at it. Why is this silly?
It's not silly at all! If you use standard Prometheus metrics and some labels, you can later also get some basic statistical analysis for free on Grafana.
What you described is called a Prometheus exporter. It would take the raw data (from the MediaWiki API?) and output the metrics in Prometheus format. You can hand-craft the metrics even in bash, but probably something like Python or Rust where you have both MediaWiki and Prometheus libraries will be easiest.
The pushgateway is the traditional solution for a batch job like this. I don't know how authentication etc. is handled in WMF though.
The metrics you described are mostly gauges. For things like the time spent sitting in queues, you may want a histogram (so you can calculate e.g. the 75th percentile or the longest-waiting proposal). This is definitely best done with a Prometheus library (but make sure to manually set the buckets to some reasonable intervals, probably in terms of hours and days, otherwise you might get some unhelpful defaults starting from ms).
https://www.robustperception.io/how-does-a-prometheus-histogram-work/ https://prometheus.io/docs/practices/histograms/
Best, Federico
Thanks for the input. Yes, in the statsd world, these are what I would have called gauges. HIstograms might be nice, but to get started, just the raw gauges will be a useful improvement over what we have now, so I figure I'd start with that. And, yes, I expect I'll implement this in some python scripts launched by cron under the toolforge jobs framework.
So, I guess if I wanted to do this on the command line, I would do:
echo "some_metric 3.14" | curl --data-binary @- http://prometheus-pushgateway.discovery.wmnet/???
where the ??? is the name of my job. Do I just make up something that looks reasonable, or is there some namespace that I get allocated for my metrics?
On May 4, 2025, at 3:07 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Il 01/05/25 20:17, Roy Smith ha scritto:
I want a graph vs time.. Which is what statsd/graphite was good at, so I assumed Prometheus would also be good at it. Why is this silly?
It's not silly at all! If you use standard Prometheus metrics and some labels, you can later also get some basic statistical analysis for free on Grafana.
What you described is called a Prometheus exporter. It would take the raw data (from the MediaWiki API?) and output the metrics in Prometheus format. You can hand-craft the metrics even in bash, but probably something like Python or Rust where you have both MediaWiki and Prometheus libraries will be easiest.
The pushgateway is the traditional solution for a batch job like this. I don't know how authentication etc. is handled in WMF though.
The metrics you described are mostly gauges. For things like the time spent sitting in queues, you may want a histogram (so you can calculate e.g. the 75th percentile or the longest-waiting proposal). This is definitely best done with a Prometheus library (but make sure to manually set the buckets to some reasonable intervals, probably in terms of hours and days, otherwise you might get some unhelpful defaults starting from ms).
https://www.robustperception.io/how-does-a-prometheus-histogram-work/ https://prometheus.io/docs/practices/histograms/
Best, Federico
It looks like prometheus-pushgateway.discovery.wmnet (as documented in https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway)) is not reachable from my VPS instance:
$ traceroute prometheus-pushgateway.discovery.wmnet traceroute to prometheus-pushgateway.discovery.wmnet (10.64.0.82), 30 hops max, 60 byte packets 1 vlan-legacy.cloudinstances2b-gw.svc.eqiad1.wikimedia.cloud (172.16.0.1) 0.657 ms 0.632 ms 0.563 ms 2 vlan1107.cloudgw1004.eqiad1.wikimediacloud.org (185.15.56.234) 0.513 ms 0.486 ms 0.440 ms 3 * * * 4 * * * 5 * * * 6 * * * 7 * * * 8 * * * 9 * * * 10 * * * 11 * * * 12 * * * 13 * * * 14 * * * 15 * * * 16 * * * 17 * * * 18 * * * 19 * * * 20 * * * 21 * * * 22 * * * 23 * * * 24 * * * 25 * * * 26 * * * 27 * * * 28 * * * 29 * * * 30 * * *
is that the correct host to be using?
On May 4, 2025, at 5:39 PM, Roy Smith roy@panix.com wrote:
Thanks for the input. Yes, in the statsd world, these are what I would have called gauges. HIstograms might be nice, but to get started, just the raw gauges will be a useful improvement over what we have now, so I figure I'd start with that. And, yes, I expect I'll implement this in some python scripts launched by cron under the toolforge jobs framework.
So, I guess if I wanted to do this on the command line, I would do:
echo "some_metric 3.14" | curl --data-binary @- http://prometheus-pushgateway.discovery.wmnet/???
where the ??? is the name of my job. Do I just make up something that looks reasonable, or is there some namespace that I get allocated for my metrics?
On May 4, 2025, at 3:07 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Il 01/05/25 20:17, Roy Smith ha scritto:
I want a graph vs time.. Which is what statsd/graphite was good at, so I assumed Prometheus would also be good at it. Why is this silly?
It's not silly at all! If you use standard Prometheus metrics and some labels, you can later also get some basic statistical analysis for free on Grafana.
What you described is called a Prometheus exporter. It would take the raw data (from the MediaWiki API?) and output the metrics in Prometheus format. You can hand-craft the metrics even in bash, but probably something like Python or Rust where you have both MediaWiki and Prometheus libraries will be easiest.
The pushgateway is the traditional solution for a batch job like this. I don't know how authentication etc. is handled in WMF though.
The metrics you described are mostly gauges. For things like the time spent sitting in queues, you may want a histogram (so you can calculate e.g. the 75th percentile or the longest-waiting proposal). This is definitely best done with a Prometheus library (but make sure to manually set the buckets to some reasonable intervals, probably in terms of hours and days, otherwise you might get some unhelpful defaults starting from ms).
https://www.robustperception.io/how-does-a-prometheus-histogram-work/ https://prometheus.io/docs/practices/histograms/
Best, Federico
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Ignoring a few special cases, the main namespace of Wikitech is used to document the "production" environment that hosts the Wikimedia project sites, and in general those services are not available for Cloud VPS/Toolforge tenants. Features usable in WMCS are generally documented in the Help: namespace.
Taavi
wikitech-l@lists.wikimedia.org