Yesterday morning I received an alert that the gadolinium udp2log process was experiencing
packet loss. In addition to being the webstats-collector host (which generates the
pagecounts files), gadolinium is a socat relay. It is responsible for feeding about 5
total udp2log instances all of the webrequest log traffic.
Upon investigating the packet loss issue on gadolinium, I noticed that the socat relay
process itself was dropping packets if the udp2log process was also up. I believe this is
due to the fact that if both socat and udp2log is running, the NIC must process twice the
amount of data than if only one is running. I went into emergency mode to move as much of
the udp2log filters to other existent udp2log boxes. Opsen and I set up a new box
(erbium) so that we could still have a box on which to run some of the gadolinium udp2log
filters (including the webstatscollector one).
Fundraising gets their webrequest data from gadolinium, so I had spent much of the day
working with them. It turned out that this wasn't so much of an emergency for them,
since they had a scheduled downtime during this time anyway.
Erbium was almost fully ready yesterday evening. When I was about to finish setting up
erbium, other opsen had started a restructuring of production puppetmaster setup, which
caused puppet to not work for a short period. I was crunched with time to finish this,
but couldn't until the puppetmaster was back up. I had urgent personal business to
take care of (had to put an application in on an apartment before someone else did), so I
ran out for the evening leaving things in this state. I was thinking mostly of
Fundraising, and they didn't' seem worried, and forgot that webstatscollector was
an issue too.
Erbium is online as of a few minutes ago and the webstatscollector processes should be
trucking along, so pageview data should be fine starting now. The webstatscollector
processes are not currently monitored. I plan to add process monitoring for both of
these, as well as UDP dropped packet statistics for both the socat relay process and the
webstatscollector process.
On Jul 24, 2013, at 1:37 AM, Jeremy Baron <jeremy(a)tuxmachine.com> wrote:
On Jul 24, 2013 12:43 AM, "Ikuya Yamada"
<ikuya(a)sfc.keio.ac.jp> wrote:
It seems that the page view statistics data does
not contain the
actual data for the last few hours.
http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-07/
Are there any failures on the server-side?
Just looking at file sizes I can see 15, 16, and 20-05(the current hour) UTC all look
smaller than normal. (yes, something's broken)
-Jeremy
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics