Monthly data loss on udp2log quantified in new report

List overview All Threads
Download

newer

older

Analytics Showcase Sprint ending...

FW: Papers open for review:...

Erik Zachte

29 Jul 2013 29 Jul '13

11:39 p.m.

Hi all,

Over the years we've had several serious issues with huge underreporting on page view data due to message loss on udp2log.

There are now several diagnostic tools: alerts are sent and there is real-time monitoring http://tinyurl.com/kqmtfss

But none of those help to quantify total monthly loss.

I upgraded an existing csv file to html report, to be updated monthly.

http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm

This reports show total monthly message loss as a percentage, plus a breakdown of message loss and traffic volume by server role and location.

Basic idea behind the report is that as we use 1:1000 sampling, for each squid server we should find sequence numbers between logged messages to be 1000 apart, on average.

If we actually find they are 1050 apart that translates into 4.7% data loss.

On how this is calculated see http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm# calc

I use a weighted average for calculating total percentage data loss, taking into account data volume per server cluster, and ignoring servers where the sequence number mechanism is still broken (ssl servers).

Role and implementation of udp2log are in flux. But in any setup it would be good to have such overall assessment of loss.

Cheers,

Erik

Attachments:

attachment.htm (text/html — 4.5 KB)

Show replies by date

Andrew Otto

30 Jul 30 Jul

12:03 a.m.

New subject: Monthly data loss on udp2log quantified in new report

This is awesome, thanks Erik!

In conjunction, there are similar ganglia metrics of these numbers on each of the udp2log boxes. Example:

http://bit.ly/13pkK3e

You can see a similar breakdown of packet_loss_average per role. These roles are defined by the pybal config: http://noc.wikimedia.org/pybal/

The packet_loss_average metric is sampled at a 1/10 level instead of 1/1000, so it will be slightly more accurate. However, these metrics don't weight anything, so if there is any loss from a role that has very few requests, the average will be skewed.

Having both of these available for troubleshooting is very useful.

Thanks again! -Ao

On Jul 29, 2013, at 10:39 AM, Erik Zachte ezachte@wikimedia.org wrote:

...

Hi all,

Over the years we've had several serious issues with huge underreporting on page view data due to message loss on udp2log.

There are now several diagnostic tools: alerts are sent and there is real-time monitoringhttp://tinyurl.com/kqmtfss But none of those help to quantify total monthly loss.

I upgraded an existing csv file to html report, to be updated monthly. http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm

This reports show total monthly message loss as a percentage, plus a breakdown of message loss and traffic volume by server role and location.

Basic idea behind the report is that as we use 1:1000 sampling, for each squid server we should find sequence numbers between logged messages to be 1000 apart, on average. If we actually find they are 1050 apart that translates into 4.7% data loss. On how this is calculated seehttp://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm#...

I use a weighted average for calculating total percentage data loss, taking into account data volume per server cluster, and ignoring servers where the sequence number mechanism is still broken (ssl servers).

Role and implementation of udp2log are in flux. But in any setup it would be good to have such overall assessment of loss.

Cheers,

Erik

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Erik Zachte

12:29 a.m.

New subject: Monthly data loss on udp2log quantified in new report

Thanks Andrew,

...

However, these metrics don't weight anything, so if there is any loss from

a role that has very few requests, the average will be skewed.

Yes, that is precisely what I thought was missing. So what my report adds is the bottom line: "how much of x% drop in MoM page views can be attributed to msg loss?"

And as we have server clusters on hot standby, being fed a trickle of data (as I understood long ago to keep caches up to date), their contribution to overall loss would be minimal.

But seeing them in red could give early warning that we would have an issue when they would become primary server.

Erik

From: analytics-bounces@lists.wikimedia.org [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Andrew Otto Sent: Monday, July 29, 2013 5:03 PM To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] Monthly data loss on udp2log quantified in new report

This is awesome, thanks Erik!

In conjunction, there are similar ganglia metrics of these numbers on each of the udp2log boxes. Example:

http://bit.ly/13pkK3e

You can see a similar breakdown of packet_loss_average per role. These roles are defined by the pybal config: http://noc.wikimedia.org/pybal/

Having both of these available for troubleshooting is very useful.

Thanks again!

-Ao

On Jul 29, 2013, at 10:39 AM, Erik Zachte ezachte@wikimedia.org wrote:

Hi all,

Over the years we've had several serious issues with huge underreporting on page view data due to message loss on udp2log.

There are now several diagnostic tools: alerts are sent and there is real-time monitoring http://tinyurl.com/kqmtfss http://tinyurl.com/kqmtfss

But none of those help to quantify total monthly loss.

I upgraded an existing csv file to html report, to be updated monthly.

<http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm

...

http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm

This reports show total monthly message loss as a percentage, plus a breakdown of message loss and traffic volume by server role and location.

Basic idea behind the report is that as we use 1:1000 sampling, for each squid server we should find sequence numbers between logged messages to be 1000 apart, on average.

If we actually find they are 1050 apart that translates into 4.7% data loss.

On how this is calculated see http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm #calc http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm# calc

Role and implementation of udp2log are in flux. But in any setup it would be good to have such overall assessment of loss.

Cheers,

Erik

_______________________________________________ Analytics mailing list mailto:Analytics@lists.wikimedia.org Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics

Toby Negrin

1:53 a.m.

New subject: Monthly data loss on udp2log quantified in new report

Thanks for putting this together Erik.

On Mon, Jul 29, 2013 at 8:29 AM, Erik Zachte ezachte@wikimedia.org wrote:

...

Thanks Andrew, ****

...
However, these metrics don't weight anything, so if there is any loss

from a role that has very few requests, the average will be skewed.****

Yes, that is precisely what I thought was missing. So what my report adds is the bottom line: "how much of x% drop in MoM page views can be attributed to msg loss?"****

And as we have server clusters on hot standby, being fed a trickle of data (as I understood long ago to keep caches up to date), their contribution to overall loss would be minimal.****

But seeing them in red could give early warning that we would have an issue when they would become primary server.****

Erik****

*From:* analytics-bounces@lists.wikimedia.org [mailto: analytics-bounces@lists.wikimedia.org] *On Behalf Of *Andrew Otto *Sent:* Monday, July 29, 2013 5:03 PM *To:* A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. *Subject:* Re: [Analytics] Monthly data loss on udp2log quantified in new report****

This is awesome, thanks Erik!****

In conjunction, there are similar ganglia metrics of these numbers on each of the udp2log boxes. Example:****

http://bit.ly/13pkK3e****

You can see a similar breakdown of packet_loss_average per role. These roles are defined by the pybal config: http://noc.wikimedia.org/pybal/***

The packet_loss_average metric is sampled at a 1/10 level instead of 1/1000, so it will be slightly more accurate. However, these metrics don't weight anything, so if there is any loss from a role that has very few requests, the average will be skewed.****

Having both of these available for troubleshooting is very useful.****

Thanks again!****

-Ao****

On Jul 29, 2013, at 10:39 AM, Erik Zachte ezachte@wikimedia.org wrote:** **

Hi all,****

Over the years we've had several serious issues with huge underreporting on page view data due to message loss on udp2log.****

There are now several diagnostic tools: alerts are sent and there is real-time monitoringhttp://tinyurl.com/kqmtfss****

But none of those help to quantify total monthly loss.****

I upgraded an existing csv file to html report, to be updated monthly.****

http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm

This reports show total monthly message loss as a percentage, plus a breakdown of message loss and traffic volume by server role and location.*

Basic idea behind the report is that as we use 1:1000 sampling, for each squid server we should find sequence numbers between logged messages to be 1000 apart, on average.****

If we actually find they are 1050 apart that translates into 4.7% data loss.****

On how this is calculated see http://stats.wikimedia.org/wikimedia/squids/SquidDataMonthlyPerSquidSet.htm#...

I use a weighted average for calculating total percentage data loss, taking into account data volume per server cluster, and ignoring servers where the sequence number mechanism is still broken (ssl servers).****

Role and implementation of udp2log are in flux. But in any setup it would be good to have such overall assessment of loss.****

Cheers,****

Erik****

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics****

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

4170

Age (days ago)

4170

Last active (days ago)

analytics@lists.wikimedia.org

3 comments

3 participants

tags (0)

participants (3)

Andrew Otto
Erik Zachte
Toby Negrin