Thanks for putting this together Erik.
On Mon, Jul 29, 2013 at 8:29 AM, Erik Zachte ezachte@wikimedia.org wrote:
Thanks Andrew, ****
However, these metrics don't weight anything, so if there is any loss
from a role that has very few requests, the average will be skewed.****
Yes, that is precisely what I thought was missing. So what my report adds is the bottom line: "how much of x% drop in MoM page views can be attributed to msg loss?"****
And as we have server clusters on hot standby, being fed a trickle of data (as I understood long ago to keep caches up to date), their contribution to overall loss would be minimal.****
But seeing them in red could give early warning that we would have an issue when they would become primary server.****
Erik****
*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Andrew Otto *Sent:* Monday, July 29, 2013 5:03 PM *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Monthly data loss on udp2log quantified in new report****
This is awesome, thanks Erik!****
In conjunction, there are similar ganglia metrics of these numbers on each of the udp2log boxes. Example:****
You can see a similar breakdown of packet_loss_average per role. These roles are defined by the pybal config: http://noc.wikimedia.org/pybal/***
The packet_loss_average metric is sampled at a 1/10 level instead of 1/1000, so it will be slightly more accurate. However, these metrics don't weight anything, so if there is any loss from a role that has very few requests, the average will be skewed.****
Having both of these available for troubleshooting is very useful.****
Thanks again!****
-Ao****
On Jul 29, 2013, at 10:39 AM, Erik Zachte ezachte@wikimedia.org wrote:** **
Hi all,****
Over the years we've had several serious issues with huge underreporting on page view data due to message loss on udp2log.****
There are now several diagnostic tools: alerts are sent and there is real-time monitoringhttp://tinyurl.com/kqmtfss****
But none of those help to quantify total monthly loss.****
I upgraded an existing csv file to html report, to be updated monthly.****
http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm
This reports show total monthly message loss as a percentage, plus a breakdown of message loss and traffic volume by server role and location.*
Basic idea behind the report is that as we use 1:1000 sampling, for each squid server we should find sequence numbers between logged messages to be 1000 apart, on average.****
If we actually find they are 1050 apart that translates into 4.7% data loss.****
On how this is calculated see http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm#...
I use a weighted average for calculating total percentage data loss, taking into account data volume per server cluster, and ignoring servers where the sequence number mechanism is still broken (ssl servers).****
Role and implementation of udp2log are in flux. But in any setup it would be good to have such overall assessment of loss.****
Cheers,****
Erik****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics