Yesterday morning I received an alert that the gadolinium udp2log process was experiencing packet loss. In addition to being the webstats-collector host (which generates the pagecounts files), gadolinium is a socat relay. It is responsible for feeding about 5 total udp2log instances all of the webrequest log traffic.

Upon investigating the packet loss issue on gadolinium, I noticed that the socat relay process itself was dropping packets if the udp2log process was also up. I believe this is due to the fact that if both socat and udp2log is running, the NIC must process twice the amount of data than if only one is running. I went into emergency mode to move as much of the udp2log filters to other existent udp2log boxes. Opsen and I set up a new box (erbium) so that we could still have a box on which to run some of the gadolinium udp2log filters (including the webstatscollector one).

Fundraising gets their webrequest data from gadolinium, so I had spent much of the day working with them. It turned out that this wasn't so much of an emergency for them, since they had a scheduled downtime during this time anyway.

Erbium was almost fully ready yesterday evening. When I was about to finish setting up erbium, other opsen had started a restructuring of production puppetmaster setup, which caused puppet to not work for a short period. I was crunched with time to finish this, but couldn't until the puppetmaster was back up. I had urgent personal business to take care of (had to put an application in on an apartment before someone else did), so I ran out for the evening leaving things in this state. I was thinking mostly of Fundraising, and they didn't' seem worried, and forgot that webstatscollector was an issue too.

Erbium is online as of a few minutes ago and the webstatscollector processes should be trucking along, so pageview data should be fine starting now. The webstatscollector processes are not currently monitored. I plan to add process monitoring for both of these, as well as UDP dropped packet statistics for both the socat relay process and the webstatscollector process.

On Jul 24, 2013, at 1:37 AM, Jeremy Baron <jeremy@tuxmachine.com> wrote:

On Jul 24, 2013 12:43 AM, "Ikuya Yamada" <ikuya@sfc.keio.ac.jp> wrote:
> It seems that the page view statistics data does not contain the
> actual data for the last few hours.
>
> http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-07/
>
> Are there any failures on the server-side?
Just looking at file sizes I can see 15, 16, and 20-05(the current hour) UTC all look smaller than normal. (yes, something's broken)
-Jeremy
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics