Re: [Analytics] Monthly data loss on udp2log quantified in new report

29 Jul 2013

      Thanks for putting this together Erik.
On Mon, Jul 29, 2013 at 8:29 AM, Erik Zachte ezachte@wikimedia.org wrote:
...
Thanks Andrew, ****

...
However, these metrics don't weight anything, so if there is any loss
from a role that has very few requests, the average will be skewed.****

Yes, that is precisely what I thought was missing. So what my report adds
is the bottom line: "how much of x% drop in MoM page views can be
attributed to msg loss?"****

And as we have server clusters on hot standby, being fed a trickle of data
(as I understood long ago to keep caches up to date), their contribution to
overall loss would be minimal.****
But seeing them in red could give early warning that we would have an
issue when they would become primary server.****

Erik****

*From:* analytics-bounces@lists.wikimedia.org [mailto:
analytics-bounces@lists.wikimedia.org] *On Behalf Of *Andrew Otto
*Sent:* Monday, July 29, 2013 5:03 PM
*To:* A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
*Subject:* Re: [Analytics] Monthly data loss on udp2log quantified in new
report****

This is awesome, thanks Erik!****

In conjunction, there are similar ganglia metrics of these numbers on each
of the udp2log boxes. Example:****

http://bit.ly/13pkK3e****

You can see a similar breakdown of packet_loss_average per role.  These
roles are defined by the pybal config:  http://noc.wikimedia.org/pybal/***

The packet_loss_average metric is sampled at a 1/10 level instead of
1/1000, so it will be slightly more accurate.  However, these metrics don't
weight anything, so if there is any loss from a role that has very few
requests, the average will be skewed.****

Having both of these available for troubleshooting is very useful.****

Thanks again!****
-Ao****

On Jul 29, 2013, at 10:39 AM, Erik Zachte ezachte@wikimedia.org wrote:**
**

Hi all,****

Over the years we've had several serious issues with huge underreporting
on page view data due to message loss on udp2log.****

There are now several diagnostic tools: alerts are sent and there is
real-time monitoringhttp://tinyurl.com/kqmtfss****
But none of those help to quantify total monthly loss.****

I upgraded an existing csv file to html report, to be updated monthly.****
http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm

This reports show total monthly message loss as a percentage, plus a
breakdown of message loss and traffic volume by server role and location.*

Basic idea behind the report is that as we use 1:1000 sampling, for each
squid server we should find sequence numbers between logged messages to be
1000 apart, on average.****
If we actually find they are 1050 apart that translates into 4.7% data
loss.****
On how this is calculated see
http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm#...

I use a weighted average for calculating total percentage data loss,
taking into account data volume per server cluster, and ignoring servers
where the sequence number mechanism is still broken (ssl servers).****

Role and implementation of udp2log are in flux. But in any setup it would
be good to have such overall assessment of loss.****

Cheers,****

Erik****

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics****

Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Monthly data loss on udp2log quantified in new report