Yesterday morning I received an alert that the gadolinium udp2log process was experiencing
packet loss. In addition to being the webstats-collector host (which generates the
pagecounts files), gadolinium is a socat relay. It is responsible for feeding about 5
total udp2log instances all of the webrequest log traffic.
Upon investigating the packet loss issue on gadolinium, I noticed that the socat relay
process itself was dropping packets if the udp2log process was also up. I believe this is
due to the fact that if both socat and udp2log is running, the NIC must process twice the
amount of data than if only one is running. I went into emergency mode to move as much of
the udp2log filters to other existent udp2log boxes. Opsen and I set up a new box
(erbium) so that we could still have a box on which to run some of the gadolinium udp2log
filters (including the webstatscollector one).
Fundraising gets their webrequest data from gadolinium, so I had spent much of the day
working with them. It turned out that this wasn't so much of an emergency for them,
since they had a scheduled downtime during this time anyway.
Erbium was almost fully ready yesterday evening. When I was about to finish setting up
erbium, other opsen had started a restructuring of production puppetmaster setup, which
caused puppet to not work for a short period. I was crunched with time to finish this,
but couldn't until the puppetmaster was back up. I had urgent personal business to
take care of (had to put an application in on an apartment before someone else did), so I
ran out for the evening leaving things in this state. I was thinking mostly of
Fundraising, and they didn't' seem worried, and forgot that webstatscollector was
an issue too.
Erbium is online as of a few minutes ago and the webstatscollector processes should be
trucking along, so pageview data should be fine starting now. The webstatscollector
processes are not currently monitored. I plan to add process monitoring for both of
these, as well as UDP dropped packet statistics for both the socat relay process and the
On Jul 24, 2013, at 1:37 AM, Jeremy Baron <jeremy(a)tuxmachine.com> wrote:
On Jul 24, 2013 12:43 AM, "Ikuya Yamada"
It seems that the page view statistics data does
not contain the
actual data for the last few hours.
Are there any failures on the server-side?
Just looking at file sizes I can see 15, 16, and 20-05(the current hour) UTC all look
smaller than normal. (yes, something's broken)
Analytics mailing list