Thanks for putting this together Erik.
On Mon, Jul 29, 2013 at 8:29 AM, Erik Zachte <ezachte(a)wikimedia.org> wrote:
Thanks Andrew, ****
** **
However, these metrics don't weight anything,
so if there is any loss
from a role that has very few requests, the average will be
skewed.****
** **
Yes, that is precisely what I thought was missing. So what my report adds
is the bottom line: "how much of x% drop in MoM page views can be
attributed to msg loss?"****
** **
And as we have server clusters on hot standby, being fed a trickle of data
(as I understood long ago to keep caches up to date), their contribution to
overall loss would be minimal.****
But seeing them in red could give early warning that we would have an
issue when they would become primary server.****
** **
Erik****
** **
** **
** **
** **
** **
*From:* analytics-bounces(a)lists.wikimedia.org [mailto:
analytics-bounces(a)lists.wikimedia.org] *On Behalf Of *Andrew Otto
*Sent:* Monday, July 29, 2013 5:03 PM
*To:* A mailing list for the Analytics Team at WMF and everybody who has
an interest in Wikipedia and analytics.
*Subject:* Re: [Analytics] Monthly data loss on udp2log quantified in new
report****
** **
This is awesome, thanks Erik!****
** **
In conjunction, there are similar ganglia metrics of these numbers on each
of the udp2log boxes. Example:****
** **
http://bit.ly/13pkK3e****
** **
You can see a similar breakdown of packet_loss_average per role. These
roles are defined by the pybal config:
http://noc.wikimedia.org/pybal/***
*
** **
The packet_loss_average metric is sampled at a 1/10 level instead of
1/1000, so it will be slightly more accurate. However, these metrics don't
weight anything, so if there is any loss from a role that has very few
requests, the average will be skewed.****
** **
Having both of these available for troubleshooting is very useful.****
** **
Thanks again!****
-Ao****
** **
****
** **
** **
On Jul 29, 2013, at 10:39 AM, Erik Zachte <ezachte(a)wikimedia.org> wrote:**
**
****
Hi all,****
****
Over the years we've had several serious issues with huge underreporting
on page view data due to message loss on udp2log.****
****
There are now several diagnostic tools: alerts are sent and there is
real-time
monitoringhttp://tinyurl.com/kqmtfss****
But none of those help to quantify total monthly loss.****
****
I upgraded an existing csv file to html report, to be updated monthly.****
http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm
****
****
This reports show total monthly message loss as a percentage, plus a
breakdown of message loss and traffic volume by server role and location.*
***
****
Basic idea behind the report is that as we use 1:1000 sampling, for each
squid server we should find sequence numbers between logged messages to be
1000 apart, on average.****
If we actually find they are 1050 apart that translates into 4.7% data
loss.****
On how this is calculated see
http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm…
****
****
I use a weighted average for calculating total percentage data loss,
taking into account data volume per server cluster, and ignoring servers
where the sequence number mechanism is still broken (ssl servers).****
****
Role and implementation of udp2log are in flux. But in any setup it would
be good to have such overall assessment of loss.****
****
Cheers,****
****
Erik****
****
****
****
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics****
** **
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics