Hey all, quick update with a summary of where we stand and the plan of attack.
tl;dr: It's probably the known-to-be-busted nginx sequence numbers on the SSL boxes polluting the average. We should modify PacketLossLogtailer.py to drop their entries.
1. Ganglia Numbers
As the Ganglia numbers are our main operational metric for data-integrity/analytics system health, it's essential we understand how they're generated.
A rough outline: - The PacketLossLogtailer.py script aggregates the packet-loss.log and reports to Ganglia - packet-loss.log, in turn, is generated by bin/packet-loss from the udplog project. From Andrew's previous email, lines look like (SPOILER ALERT):
[2013-03-20T21:38:47] ssl1 lost: (62.44131 +/- 4.19919)% [2013-03-20T21:38:47] ssl1001 lost: (66.71974 +/- 0.20411)% [2013-03-20T21:38:47] ssl1002 lost: (66.78863 +/- 0.19653)%
Andrew is digging into the source to ensure we understand how these error intervals are generated, especially given the logtailer drops entries with a high error margin.
We assume we won't find anything shocking in packet-loss.cpp, but we need to be sure we understand what these numbers mean.
2. Current Alerts
Erik Zachte performed analysis of sequence number deltas via the sampled 1:1000 squid logs, finding the SSL hosts had loss 60%+; Andrew verified this using the packet-loss logs. It's a known bug that nginx generates busted sequence numbers, so this is probably the explanation. Andrew is following up.
There's a clear trend visible if you look at the loss numbers over the last month -- the 90th shows it most clearly:
- 90th Percentile: http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&title... - Average: http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&title... - Last week: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=...
As Oxygen does not receive logging from the SSL proxies, this seems almost certain to be the answer.
However, while filtering out entries in the packet-loss log from the SSL hosts might fix the alerts, it doesn't explain why we're seeing the increase in the first place. More troubling, there was a huge increase in Oct that we somehow did not notice:
- Average: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&title=... - 90th: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&title=...
There are theories aplenty, with both load and filters contributing to the loss numbers. Further, different filters run on each box (themselves with a wide range of hardware), so their performance isn't perfectly comparable.
Andrew's on point looking into these questions, but the immediate next-step is to ignore the SSL machines in the packet-loss log. Modifying PacketLossLogtailer.py lets us preserve the data for future analysis.
-- David Schoonover dsc@wikimedia.org
On Wed, Mar 20, 2013 at 3:11 PM, Diederik van Liere <dvanliere@wikimedia.org
wrote:
I do think that the Nginx / SSL servers are skewing the packetloss numbers because https://rt.wikimedia.org/Ticket/Display.html?id=859 has never been resolved. D
On Wed, Mar 20, 2013 at 6:02 PM, Andrew Otto otto@wikimedia.org wrote:
(Tim, I'm CCing you here, maybe you can help me interpret udp2log packet loss numbers?)
Hm, so, I've set up gadolinium as a 4th udp2log webrequest host. It's running some (but not all) of the same filters that locke is currently running. udp2log packet_loss_average has hovered at 4 or 5% the entire time it has been up.
http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20...
Is that normal? I don't think so. This leads me to believe that my previous diagnosis (increased traffic hitting some thresholds) was wrong.
I see high numbers for packet loss from the SSL machines in both locke and gadolinium's packet-loss.log. e.g.
[2013-03-20T21:38:47] ssl1 lost: (62.44131 +/- 4.19919)% [2013-03-20T21:38:47] ssl1001 lost: (66.71974 +/- 0.20411)% [2013-03-20T21:38:47] ssl1002 lost: (66.78863 +/- 0.19653)% …
Also, some of sq*.wikimedia.org hosts (I think these are the Tampa squids) have higher than normal packet loss numbers, whereas the eqiad and exams hosts are mostly close to 0%.
Could it be that a few machines here are skewing the packet loss average?
The fact that gadolinium is seeing packet loss with not much running on in indicates to me that I'm totally missing something here. Since the data on gadolinium isn't being consumed yet, and its still duplicated on locke, I'm going to turn off pretty much all filters there and see what happens.
-Ao
On Mar 19, 2013, at 10:11 AM, Andrew Otto otto@wikimedia.org wrote:
(Jeff, I'm CCing you here because you're involved in locke stuff.)
- how well does the solution reduce/eliminate packet loss? (Erik Z's
input useful here)
I could be totally wrong with my diagnosis of the problem (especially since there was packet loss on emery yesterday too), but I believe that if we do the proposed solutions (replace locke, reduce filters on oxygen) this should eliminate the regular packet loss for the time being. emery might need some help too. Perhaps we can keep locke around for a while longer and spread out emery's filter's between the locke and gadolinium.
- is the solution temporary or permanent?
If it solves the problem, this is as permanent as it gets. We can expect traffic to only increase, so eventually this problem could happen again. Also, people always want more data, so its possible that there will be enough demand for new filters that our current hardware would reach capacity again. Eventually we'd like for Kafka to replace much of udp2log's functionality. That is our actual permanent solution.
- how soon could we implement the solution?
I will work on gadolinium (locke replacement) today. Oxygen filter reduction is pending verification from Evan that it is ok to do so. He's not so sure.
- what kind of investment in time/resource would it would take and from
whom (likely)?
locke replacement shouldn't take more than a day. The production frontend cache servers all need config changes deployed to them. I'd need to get that reviewed and a helpful ops babysitter.
Jeff Green has special fundraising stuff on locke, so we'd need his help to migrate that off. Perhaps we should leave locke running and let Jeff Green keep it for Fundraising?
On Mar 18, 2013, at 9:31 PM, Kraig Parkinson kparkinson@wikimedia.org wrote:
Andrew/Diederik, we've captured a lot of info in this thread so far. Could you briefly articulate/summarize the set of options we have for solving this problem as well as help us understand the trade-offs?
I'm interested in seeing the differences in terms of the following:
- how well does the solution reduce/eliminate packet loss? (Erik Z's
input useful here) 2. is the solution temporary or permanent? 3. how soon could we implement the solution? 4. what kind of investment in time/resource would it would take and from whom (likely)?
Kraig
On Mon, Mar 18, 2013 at 2:02 PM, Rob Lanphier robla@wikimedia.orgwrote:
Hi Diederik,
I don't think it'd be responsible to let this go on for another 2-3 days. It's already arguably been going on for too long as it stands. If you all haven't been cautioning people to adjust their numbers based on known loss, and you haven't been doing that yourselves, you probably should let folks know.
At any rate, I'm probably not the customer here. In the team, Erik Zachte probably has the best perspective on what is acceptable and what isn't, so I'm cc'ing him.
Rob
On Mon, Mar 18, 2013 at 12:31 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Robla: when you say 'soonish' what exact timeframe do you have in mind
for
solving this? D
On Mon, Mar 18, 2013 at 3:25 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Office wiki: https://office.wikimedia.org/wiki/Partner_IP_Ranges udp-filter config:
https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=templ...
D
On Mon, Mar 18, 2013 at 2:51 PM, David Schoonover dsc@wikimedia.org wrote:
For reference, here's the varnish config in operations/puppet.
https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=templ...
I don't have links to the office wiki list or the udp-filter config; Deiderik, could you link them just to double-check?
-- David Schoonover dsc@wikimedia.org
On Mon, Mar 18, 2013 at 11:41 AM, Diederik van Liere dvanliere@wikimedia.org wrote: > > >> It's not a big difference, but it seems like good reason to be a
little
>> wary about switching between the ip logic and the varnish logic.
Thoughts?
> > > There is no fundamental difference between ip logic and varnish
logic,
> varnish sets the X-CS header on the same ip-ranges as udp-filter
does. I
> just checked the office wiki, the varnish config and udp-filter and
for
> Orange Uganda everything looks good. > > D
Yup!
More data:
https://gist.github.com/ottomata/5215342
I took Ben's PacketLossLogtailer.py script and striped out the Ganglia specific stuff. PacketLossAverager.py does exactly the same thing Ben's did, except reads from stdin and outputs to stdout.
I then took the current packet-loss.log file from locke, and split it into 1 file per run (about 1 file for every 10 minutes in the last 24 hours). I then ran PacketLossAverager.py on each file and output the results. I did this a few times, each time removing or only focusing on particular host groups (all hosts, only ssl hosts, only pmtpa squid hosts, etc.)
The most interesting numbers are these: SSL hosts regularly report > 60% loss. https://gist.github.com/ottomata/5215342#file-locke_loss_ssl_hosts-tsv
pmtpa squids hover around 0%, but with a larger and variable margin of error. This is probably due to the fact that they don't send as much traffic as the other hosts. https://gist.github.com/ottomata/5215342#file-locke_loss_sq_hosts-tsv
If I remove just SSL hosts from the calculation, average loss goes down to around 0%. If I remove both SSL hosts and tampa squids, average loss is again around 0%, with an error margin of ~0.5% https://gist.github.com/ottomata/5215342#file-locke_loss_no_sq_or_ssl_hosts-...
Erik Zachte took the sampled-1000 logs and computed the average distance between sequence numbers per month. I've loaded these into a Google Spreadsheet. http://bit.ly/Zg9fYS
Since this is sampled-1000, a healthy monthly average sequence distance is about 1000. There are only a few errant notables.
Since November, the upload pmtpa squids (last column on the right) have an average distance of about 1200. That's 20% off of what it should be. More importantly, the SSL hosts in all clusters are way way off.
nginx (SSL) udp2log sequence numbers have never worked, but we are including them in our packet loss calculations anyway.
Here are the relevant ganglia links again: Average Loss: http://bit.ly/10oAoxl 90th Percentile Loss: http://bit.ly/103dUiW
As David said, we still have to figure out what happened in October. Whatever happened, notice that oxygen's packet loss average didn't jump up as high as the others did in October. oxygen does not receive logs from the nginx SSL servers, so its numbers aren't as skewed.
If we really want reliable packet loss numbers, we should do two things.
1. Remove nginx from packet-loss calculation.
Since the nginx SSL logs are proxy duplicates of real logs anyway, and oxygen is already not receiving them, I think we should just stop including them in the webrequest udp2log stream altogether. If we need information about SSL requests, we can start logging the X-Forwarded-Proto header. Or we can make nginx log to a different port, and keep those requests as their own logs.
2. Make PacketLossLogtailer.py smarter. Currently, the numbers emitted to ganglia do not account for traffic proportion or margin of error. PacketLossLogtailer.py looks at packet-loss.log (generated by packet-loss.cpp in the udplog repo), sums the percent loss for each distinct hostname, and then averages those. We could make this smarter by using the margin of error or the relative traffic proportion per host to weight the host's percent loss in the overall percent loss average.
Or, could we use the median percent loss instead of the average?
(Also, I still need to read packet-loss.cpp to make sure we understand how the error margin and percent loss per host is calculated in the first place…)
-Ao
On Mar 21, 2013, at 5:20 PM, David Schoonover dsc@wikimedia.org wrote:
Hey all, quick update with a summary of where we stand and the plan of attack.
tl;dr: It's probably the known-to-be-busted nginx sequence numbers on the SSL boxes polluting the average. We should modify PacketLossLogtailer.py to drop their entries.
- Ganglia Numbers
As the Ganglia numbers are our main operational metric for data-integrity/analytics system health, it's essential we understand how they're generated.
A rough outline:
- The PacketLossLogtailer.py script aggregates the packet-loss.log and reports to Ganglia
- packet-loss.log, in turn, is generated by bin/packet-loss from the udplog project. From Andrew's previous email, lines look like (SPOILER ALERT):
[2013-03-20T21:38:47] ssl1 lost: (62.44131 +/- 4.19919)% [2013-03-20T21:38:47] ssl1001 lost: (66.71974 +/- 0.20411)% [2013-03-20T21:38:47] ssl1002 lost: (66.78863 +/- 0.19653)%
Andrew is digging into the source to ensure we understand how these error intervals are generated, especially given the logtailer drops entries with a high error margin.
We assume we won't find anything shocking in packet-loss.cpp, but we need to be sure we understand what these numbers mean.
- Current Alerts
Erik Zachte performed analysis of sequence number deltas via the sampled 1:1000 squid logs, finding the SSL hosts had loss 60%+; Andrew verified this using the packet-loss logs. It's a known bug that nginx generates busted sequence numbers, so this is probably the explanation. Andrew is following up.
There's a clear trend visible if you look at the loss numbers over the last month -- the 90th shows it most clearly:
- 90th Percentile: http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&title...
- Last week: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=...
As Oxygen does not receive logging from the SSL proxies, this seems almost certain to be the answer.
However, while filtering out entries in the packet-loss log from the SSL hosts might fix the alerts, it doesn't explain why we're seeing the increase in the first place. More troubling, there was a huge increase in Oct that we somehow did not notice:
- Average: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&title=...
- 90th: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&title=...
There are theories aplenty, with both load and filters contributing to the loss numbers. Further, different filters run on each box (themselves with a wide range of hardware), so their performance isn't perfectly comparable.
Andrew's on point looking into these questions, but the immediate next-step is to ignore the SSL machines in the packet-loss log. Modifying PacketLossLogtailer.py lets us preserve the data for future analysis.
-- David Schoonover dsc@wikimedia.org
On Wed, Mar 20, 2013 at 3:11 PM, Diederik van Liere dvanliere@wikimedia.org wrote: I do think that the Nginx / SSL servers are skewing the packetloss numbers because https://rt.wikimedia.org/Ticket/Display.html?id=859 has never been resolved. D
On Wed, Mar 20, 2013 at 6:02 PM, Andrew Otto otto@wikimedia.org wrote: (Tim, I'm CCing you here, maybe you can help me interpret udp2log packet loss numbers?)
Hm, so, I've set up gadolinium as a 4th udp2log webrequest host. It's running some (but not all) of the same filters that locke is currently running. udp2log packet_loss_average has hovered at 4 or 5% the entire time it has been up.
http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20...
Is that normal? I don't think so. This leads me to believe that my previous diagnosis (increased traffic hitting some thresholds) was wrong.
I see high numbers for packet loss from the SSL machines in both locke and gadolinium's packet-loss.log. e.g.
[2013-03-20T21:38:47] ssl1 lost: (62.44131 +/- 4.19919)% [2013-03-20T21:38:47] ssl1001 lost: (66.71974 +/- 0.20411)% [2013-03-20T21:38:47] ssl1002 lost: (66.78863 +/- 0.19653)% …
Also, some of sq*.wikimedia.org hosts (I think these are the Tampa squids) have higher than normal packet loss numbers, whereas the eqiad and exams hosts are mostly close to 0%.
Could it be that a few machines here are skewing the packet loss average?
The fact that gadolinium is seeing packet loss with not much running on in indicates to me that I'm totally missing something here. Since the data on gadolinium isn't being consumed yet, and its still duplicated on locke, I'm going to turn off pretty much all filters there and see what happens.
-Ao
On Mar 19, 2013, at 10:11 AM, Andrew Otto otto@wikimedia.org wrote:
(Jeff, I'm CCing you here because you're involved in locke stuff.)
- how well does the solution reduce/eliminate packet loss? (Erik Z's input useful here)
I could be totally wrong with my diagnosis of the problem (especially since there was packet loss on emery yesterday too), but I believe that if we do the proposed solutions (replace locke, reduce filters on oxygen) this should eliminate the regular packet loss for the time being. emery might need some help too. Perhaps we can keep locke around for a while longer and spread out emery's filter's between the locke and gadolinium.
- is the solution temporary or permanent?
If it solves the problem, this is as permanent as it gets. We can expect traffic to only increase, so eventually this problem could happen again. Also, people always want more data, so its possible that there will be enough demand for new filters that our current hardware would reach capacity again. Eventually we'd like for Kafka to replace much of udp2log's functionality. That is our actual permanent solution.
- how soon could we implement the solution?
I will work on gadolinium (locke replacement) today. Oxygen filter reduction is pending verification from Evan that it is ok to do so. He's not so sure.
- what kind of investment in time/resource would it would take and from whom (likely)?
locke replacement shouldn't take more than a day. The production frontend cache servers all need config changes deployed to them. I'd need to get that reviewed and a helpful ops babysitter.
Jeff Green has special fundraising stuff on locke, so we'd need his help to migrate that off. Perhaps we should leave locke running and let Jeff Green keep it for Fundraising?
On Mar 18, 2013, at 9:31 PM, Kraig Parkinson kparkinson@wikimedia.org wrote:
Andrew/Diederik, we've captured a lot of info in this thread so far. Could you briefly articulate/summarize the set of options we have for solving this problem as well as help us understand the trade-offs?
I'm interested in seeing the differences in terms of the following:
- how well does the solution reduce/eliminate packet loss? (Erik Z's input useful here)
- is the solution temporary or permanent?
- how soon could we implement the solution?
- what kind of investment in time/resource would it would take and from whom (likely)?
Kraig
On Mon, Mar 18, 2013 at 2:02 PM, Rob Lanphier robla@wikimedia.org wrote: Hi Diederik,
I don't think it'd be responsible to let this go on for another 2-3 days. It's already arguably been going on for too long as it stands. If you all haven't been cautioning people to adjust their numbers based on known loss, and you haven't been doing that yourselves, you probably should let folks know.
At any rate, I'm probably not the customer here. In the team, Erik Zachte probably has the best perspective on what is acceptable and what isn't, so I'm cc'ing him.
Rob
On Mon, Mar 18, 2013 at 12:31 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Robla: when you say 'soonish' what exact timeframe do you have in mind for solving this? D
On Mon, Mar 18, 2013 at 3:25 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Office wiki: https://office.wikimedia.org/wiki/Partner_IP_Ranges udp-filter config: https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=templ... D
On Mon, Mar 18, 2013 at 2:51 PM, David Schoonover dsc@wikimedia.org wrote:
For reference, here's the varnish config in operations/puppet.
https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=templ...
I don't have links to the office wiki list or the udp-filter config; Deiderik, could you link them just to double-check?
-- David Schoonover dsc@wikimedia.org
On Mon, Mar 18, 2013 at 11:41 AM, Diederik van Liere dvanliere@wikimedia.org wrote: > > >> It's not a big difference, but it seems like good reason to be a little >> wary about switching between the ip logic and the varnish logic. Thoughts? > > > There is no fundamental difference between ip logic and varnish logic, > varnish sets the X-CS header on the same ip-ranges as udp-filter does. I > just checked the office wiki, the varnish config and udp-filter and for > Orange Uganda everything looks good. > > D
So here is the final breakdown of 'packet loss' and traffic volume per squid range
Squids are broken down by role (from CommonSettings.php) and location
Data based on 1:1000 sampled squid log files from Locke, 2010-04 till 2013-03
'packet loss': http://bit.ly/Y3Hqam (actually avg gap between sequence numbers)
hourly load: http://bit.ly/Y3Iifm (x 1000)
Clearly there is no capacity issue at Locke, as some squid sets are serviced perfectly.
Some extra observations:
- All ssl squids have corrupted sequence numbering (see also Davids earlier mail)
- text squids have < 0.1% loss in recent months
- upload squids series knsq* (knams) have 2-3% loss in Feb/Mar,
upload squids series cp* (eqiad) and amssq* (esams) are fine
(upload squids in pmtpa have 23% loss since Nov 2012 but their load has dropped to almost zero so that doesn't weigh in on Ganglia trends)
- API squids pmpta: avg gap is well below 1000 since May 2012: (how can that be?)
Erik
Hi Erik,
On Mar 22, 2013, at 3:02 PM, Erik Zachte ezachte@wikimedia.org wrote:
text squids have < 0.1% loss in recent months
upload squids series knsq* (knams) have 2-3% loss in Feb/Mar,
My best explanation for that is that these old Squids are on an old, different switch. We're getting rid of both within the next few weeks.
upload squids series cp* (eqiad) and amssq* (esams) are fine (upload squids in pmtpa have 23% loss since Nov 2012 but their load has dropped to almost zero so that doesn't weigh in on Ganglia trends)
- API squids pmpta: avg gap is well below 1000 since May 2012: (how can that be?)
Yeah, all pmtpa caches have very little load on them now, so 1:1000 sampled log statistics won't be very accurate for those.