It seems as of November 29th, something in our logging infrastructure has stopped. Does anyone know what happened?
Debugging:
Looking at the graphs[1] reveals all data points have them plummeted straight down that day and haven't shown any activity since. Except on December 4th we observed that the unrealistic value of "0" stopped begin echoed as well. After that, the line is absent entirely.
Example: http://graphite.wikimedia.org/render/?title=addOnloadHook&width=900&...
Raw count: http://graphite.wikimedia.org/render/?width=900&height=300&target=mw...
The software stack:
* graphite:statsd: - https://github.com/wikimedia/operations-puppet/commits/3a1134921/modules/txs... - https://github.com/wikimedia/operations-puppet/commits/3a1134921/modules/web... * EventLogging: https://meta.wikimedia.org/wiki/Schema:DeprecatedUsage * WikimediaEvents: https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/34e42... * mw.log.deprecate: https://github.com/wikimedia/mediawiki/blob/c6131c5df4/resources/src/mediawi...
Nothing there stood out to me.
When connecting to tcp://vanadium.eqiad.wmnet:8600 via zmq from tin.eqiad.wmnet directly[2], there's still a steady flow of incoming data via EventLogging. So it must've stagnated somewhere further down the line (on servers I don't have access to).
Best, — Timo
[1] mw.js.deprecate graphs: last 5 months: http://codepen.io/Krinkle/full/zyodJ/ last 3 weeks: http://codepen.io/Krinkle/full/cBGCl/
Hi Timo,
[ adding Ori to CC, since I am not sure whether or not he is on this list, and it seems he knows more about webperf ]
On Sat, Dec 20, 2014 at 05:25:15AM +0000, Timo Tijhof wrote:
Looking at the graphs [...] reveals all data points have them plummeted straight down that day and haven't shown any activity since.
It seems 'statsd-mw-js-deprecate' webperf service on hafnium did not properly detect (or recover from) a restart of the EventLogging's zmq [1].
Not sure who maintains the 'statsd-mw-js-deprecate' webperf service [2].
But I've been bold [3], and restarted it, and it seems the service is now working again, and producing the needed numbers:
https://graphite.wikimedia.org/render/?width=640&height=480&_salt=14...
Have fun, Christian
[1] https://graphite.wikimedia.org/render/?width=640&height=480&_salt=14...
[2] It's not EventLogging. The code is directly in puppet. Most webperf commits came from Ori, but this Python file seems to have been written by you.
[3] I could find neither you nor Ori in IRC, and the service was not doing it's job anyways, and looking at the code, it looked like it's safe to restart.
Hi,
On Sat, Dec 20, 2014 at 11:01:00PM +0100, Christian Aistleitner wrote:
[ webperf scripts ]
Just in case anyone needs to adhoc debug such issues again, I started
https://wikitech.wikimedia.org/wiki/Webperf
.
Have fun, Christian