Yesterday morning I received an alert that the gadolinium udp2log process was experiencing packet loss.  In addition to being the webstats-collector host (which generates the pagecounts files), gadolinium is a socat relay.  It is responsible for feeding about 5 total udp2log instances all of the webrequest log traffic.

Upon investigating the packet loss issue on gadolinium, I noticed that the socat relay process itself was dropping packets if the udp2log process was also up.  I believe this is due to the fact that if both socat and udp2log is running, the NIC must process twice the amount of data than if only one is running.  I went into emergency mode to move as much of the udp2log filters to other existent udp2log boxes.  Opsen and I set up a new box (erbium) so that we could still have a box on which to run some of the gadolinium udp2log filters (including the webstatscollector one).

Fundraising gets their webrequest data from gadolinium, so I had spent much of the day working with them.  It turned out that this wasn't so much of an emergency for them, since they had a scheduled downtime during this time anyway.

Erbium was almost fully ready yesterday evening.  When I was about to finish setting up erbium, other opsen had started a restructuring of production puppetmaster setup, which caused puppet to not work for a short period.  I was crunched with time to finish this, but couldn't until the puppetmaster was back up.  I had urgent personal business to take care of (had to put an application in on an apartment before someone else did), so I ran out for the evening leaving things in this state.  I was thinking mostly of Fundraising, and they didn't' seem worried, and forgot that webstatscollector was an issue too.

Erbium is online as of a few minutes ago and the webstatscollector processes should be trucking along, so pageview data should be fine starting now.  The webstatscollector processes are not currently monitored.  I plan to add process monitoring for both of these, as well as UDP dropped packet statistics for both the socat relay process and the webstatscollector process.

> It seems that the page view statistics data does not contain the
> actual data for the last few hours.
> Are there any failures on the server-side?

Just looking at file sizes I can see 15, 16, and 20-05(the current hour) UTC all look smaller than normal. (yes, something's broken)


