So, we've had conversations about detecting SSL terminators, for two reasons:
1. It would allow us to know when, particularly, we should trust x_forwarded_for fields for geolocation; 2. More importantly, it would allow us to reliably exclude traffic from internal IP ranges without excluding SSL traffic.
Aaron talked to Ops about this problem (notes at http://etherpad.wikimedia.org/p/ssl_terminators) - in conversation with Ori, though, I found out that this approach won't actually work, because caches != SSL terminators, all the time.
So: what's the right approach? How do we find these things easily and automagically.
Hi Oliver,
On Wed, Dec 10, 2014 at 08:22:18PM -0500, Oliver Keyes wrote:
So, we've had conversations about detecting SSL terminators, for two reasons: [...] So: what's the right approach? How do we find these things easily and automagically.
The “right” approach depends a bit on the stream that you're looking at. But I figure you're mostly interested in Hive data (for different streams, there are other methods).
More or less the same question got asked on the internal list on Sunday. There I pointed towards pybal:
On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote:
Hi,
On Fri, Dec 05, 2014 at 03:23:45PM -0600, Aaron Halfaker wrote:
And wrote up some brief notes in http://etherpad.wikimedia.org/p/ssl_terminators
In that etherpad you wrote:
Etherpad> * Scan through: https://github.com/wikimedia/operations-puppet/blob/production/manifests/sit... Etherpad> * Look for anything with role::cache::*
[...]
If you want even less puppet munging, and a more robust format, you can instead go to pybal directly.
http://config-master.wikimedia.org/pybal/
. For example
I think that still holds true.
Does that approach not work, or are you just trying to get the response to the public list? ;-)
If it's the former, please let me know where you think this approach is failing.
If it's the latter ... yay for using the public list! ... here you go. It's on the public list :-D
Have fun, Christian
On 11 December 2014 at 11:52, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Oliver,
On Wed, Dec 10, 2014 at 08:22:18PM -0500, Oliver Keyes wrote:
So, we've had conversations about detecting SSL terminators, for two reasons: [...] So: what's the right approach? How do we find these things easily and automagically.
The “right” approach depends a bit on the stream that you're looking at. But I figure you're mostly interested in Hive data (for different streams, there are other methods).
More or less the same question got asked on the internal list on Sunday. There I pointed towards pybal:
On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote:
Hi,
On Fri, Dec 05, 2014 at 03:23:45PM -0600, Aaron Halfaker wrote:
And wrote up some brief notes in http://etherpad.wikimedia.org/p/ssl_terminators
In that etherpad you wrote:
Etherpad> * Scan through:
https://github.com/wikimedia/operations-puppet/blob/production/manifests/sit...
Etherpad> * Look for anything with role::cache::*
[...]
If you want even less puppet munging, and a more robust format, you can instead go to pybal directly.
http://config-master.wikimedia.org/pybal/
. For example
I think that still holds true.
Does that approach not work, or are you just trying to get the response to the public list? ;-)
If it's the former, please let me know where you think this approach is failing.
If it's the latter ... yay for using the public list! ... here you go. It's on the public list :-D
"yes" :D. I want to make these conversations public, and for us to bias more towards using the public list - but there was also a point of confusion on how we detected these machines, using puppet. If pybal clarifies it, yay!
I'm not sure how to interpret the pybal, but that's probably because my explanation of the problem was tremendously unclear. Essentially; we want to be excluding internal IP spaces, because that contains a lot of automatically-generated traffic (fundraising, I'm looking at you). So, we exclude all requests from IPs within our ranges. Except, then we also exclude all the SSL traffic, since that will appear to come from an internal IP address, from the point of view of the request logs.
So, do I interpret this pybal as: if it's tagged as HTTPS, it's an SSL terminator, and so requests from those machines, from internal IP addresses, should be included? Or: those are the SSL machines, find out their IP addresses and you find out the internal IPs that represent SSLd requests, rather than internally-generated traffic?
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
There must be some way to tag traffic as https or not from at the nginx or varnish level, no? Has anyone looked into this?
On Dec 11, 2014, at 18:27, Oliver Keyes okeyes@wikimedia.org wrote:
On 11 December 2014 at 11:52, Christian Aistleitner <christian@quelltextlich.at mailto:christian@quelltextlich.at> wrote: Hi Oliver,
On Wed, Dec 10, 2014 at 08:22:18PM -0500, Oliver Keyes wrote:
So, we've had conversations about detecting SSL terminators, for two reasons: [...] So: what's the right approach? How do we find these things easily and automagically.
The “right” approach depends a bit on the stream that you're looking at. But I figure you're mostly interested in Hive data (for different streams, there are other methods).
More or less the same question got asked on the internal list on Sunday. There I pointed towards pybal:
On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote:
Hi,
On Fri, Dec 05, 2014 at 03:23:45PM -0600, Aaron Halfaker wrote:
And wrote up some brief notes in http://etherpad.wikimedia.org/p/ssl_terminators http://etherpad.wikimedia.org/p/ssl_terminators
In that etherpad you wrote:
Etherpad> * Scan through: https://github.com/wikimedia/operations-puppet/blob/production/manifests/sit... https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp Etherpad> * Look for anything with role::cache::*
[...]
If you want even less puppet munging, and a more robust format, you can instead go to pybal directly.
http://config-master.wikimedia.org/pybal/ http://config-master.wikimedia.org/pybal/
. For example
http://config-master.wikimedia.org/pybal/esams/text-https http://config-master.wikimedia.org/pybal/esams/text-https
I think that still holds true.
Does that approach not work, or are you just trying to get the response to the public list? ;-)
If it's the former, please let me know where you think this approach is failing.
If it's the latter ... yay for using the public list! ... here you go. It's on the public list :-D
"yes" :D. I want to make these conversations public, and for us to bias more towards using the public list - but there was also a point of confusion on how we detected these machines, using puppet. If pybal clarifies it, yay!
I'm not sure how to interpret the pybal, but that's probably because my explanation of the problem was tremendously unclear. Essentially; we want to be excluding internal IP spaces, because that contains a lot of automatically-generated traffic (fundraising, I'm looking at you). So, we exclude all requests from IPs within our ranges. Except, then we also exclude all the SSL traffic, since that will appear to come from an internal IP address, from the point of view of the request logs.
So, do I interpret this pybal as: if it's tagged as HTTPS, it's an SSL terminator, and so requests from those machines, from internal IP addresses, should be included? Or: those are the SSL machines, find out their IP addresses and you find out the internal IPs that represent SSLd requests, rather than internally-generated traffic?
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at mailto:christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 tel:%2B43%207946%20%2F%2020%205%2081 Fax: +43 7946 / 20 5 81 tel:%2B43%207946%20%2F%2020%205%2081 Homepage: http://quelltextlich.at/ http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org mailto:Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
There definitely is - it's done for mobile, for example - and Christian and I discussed it when I was experimenting with the sampled logs - but I can't find the thread right now. Bah :/
On 12 December 2014 at 09:41, Andrew Otto aotto@wikimedia.org wrote:
There must be some way to tag traffic as https or not from at the nginx or varnish level, no? Has anyone looked into this?
On Dec 11, 2014, at 18:27, Oliver Keyes okeyes@wikimedia.org wrote:
On 11 December 2014 at 11:52, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Oliver,
On Wed, Dec 10, 2014 at 08:22:18PM -0500, Oliver Keyes wrote:
So, we've had conversations about detecting SSL terminators, for two reasons: [...] So: what's the right approach? How do we find these things easily and automagically.
The “right” approach depends a bit on the stream that you're looking at. But I figure you're mostly interested in Hive data (for different streams, there are other methods).
More or less the same question got asked on the internal list on Sunday. There I pointed towards pybal:
On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote:
Hi,
On Fri, Dec 05, 2014 at 03:23:45PM -0600, Aaron Halfaker wrote:
And wrote up some brief notes in http://etherpad.wikimedia.org/p/ssl_terminators
In that etherpad you wrote:
Etherpad> * Scan through:
https://github.com/wikimedia/operations-puppet/blob/production/manifests/sit...
Etherpad> * Look for anything with role::cache::*
[...]
If you want even less puppet munging, and a more robust format, you can instead go to pybal directly.
http://config-master.wikimedia.org/pybal/
. For example
I think that still holds true.
Does that approach not work, or are you just trying to get the response to the public list? ;-)
If it's the former, please let me know where you think this approach is failing.
If it's the latter ... yay for using the public list! ... here you go. It's on the public list :-D
"yes" :D. I want to make these conversations public, and for us to bias more towards using the public list - but there was also a point of confusion on how we detected these machines, using puppet. If pybal clarifies it, yay!
I'm not sure how to interpret the pybal, but that's probably because my explanation of the problem was tremendously unclear. Essentially; we want to be excluding internal IP spaces, because that contains a lot of automatically-generated traffic (fundraising, I'm looking at you). So, we exclude all requests from IPs within our ranges. Except, then we also exclude all the SSL traffic, since that will appear to come from an internal IP address, from the point of view of the request logs.
So, do I interpret this pybal as: if it's tagged as HTTPS, it's an SSL terminator, and so requests from those machines, from internal IP addresses, should be included? Or: those are the SSL machines, find out their IP addresses and you find out the internal IPs that represent SSLd requests, rather than internally-generated traffic?
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Andrew,
On Fri, Dec 12, 2014 at 09:41:11AM -0500, Andrew Otto wrote:
There must be some way to tag traffic as https or not from at the nginx or varnish level, no? Has anyone looked into this?
Yes. On the mobile caches, varnish adds a https=1 tag to the X-Analytics field [1].
But as nice and easy Varnish tagging looks on the outside, Varnish tagging has burned us many times in many different ways around Wikipedia Zero. The fact that we cannot run written logs through VCL logic again is a deal breaker.
So assume we extend the above https=1 Varnish tagging to bits, text, and upload too. Then we build analytics machinery relying on those tags. That is nice and shiny until varnish tagging breaks for the first time (and it will break for sure). Typically, we won't notice immediately, but only some time afterwards. Say two days after it happened. How would we re-process the data for those two days?
I do not know of a way to automatically pass our written logs through the VCL tagging machinery again. Hence, (to make up for the mistagging of those two days) we'd have to re-implement the Varnish logic in the cluster and re-tag all log lines somewhere on the cluster.
So at the end of the day, we: * Have implemented https tagging logic in Varnish. * Have implemented https tagging logic in the cluster. * Maybe have to keep those two implementations in sync. * Are scared of Varnish's https tagging breaking again (at least I would be).
We can remove 3 of those 4 items, if we implement https tagging in the cluster right away. We cannot escape it, if we want good data. And it removes so much pressure.
Have fun, Christian
P.S.: How we implement https tagging in the cluster is up for discussion.
Detecting IPs has good (not perfect) quality and is pretty robust against misconfigurations on the pipeline. We can do that as of today.
An alternative might be to start tracking X-Forwarded-Proto, which would be way simpler than the IP approach. But it has good quality too and is way more robust than X-Analytics. But that would need more research, and would require us again to add a column to the logging format (which last time made the table explode).
[1] See row “https” in
https://wikitech.wikimedia.org/wiki/X-Analytics
Hi,
On Thu, Dec 11, 2014 at 06:27:02PM -0500, Oliver Keyes wrote:
On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote:
[...] I'm not sure how to interpret the pybal,
The exemplary file linked above holds lines like
{ 'host': 'amssq36.esams.wmnet', 'weight': 1, 'enabled': True }
Such a line means:
The host 'amssq36.esams.wmnet' [1] is [2] an SSL terminator for text cluster in esams [3], and has weight 1 [4].
Essentially; we want to be excluding internal IP spaces, because that contains a lot of automatically-generated traffic (fundraising, I'm looking at you)
Oliver, I do not like blaming games. You blamed Fundraising before to cause lots of internal requests. And I called you out on that before to please provide an example. However, you failed to provide an example. And yet you call out Fundraising again.
Please provide an example [5] of such traffic, so we're all on the same page.
So, we exclude all requests from IPs within our ranges. Except, then we also exclude all the SSL traffic, since that will appear to come from an internal IP address, from the point of view of the request logs.
So, do I interpret this pybal as: if it's tagged as HTTPS,
Since you use 'tag' in different contexts around https, let me clarify how I read 'tag' here. I read it as “If a pybal *-https file lists a host as enabled with positive weight in a line that is not commented out"
it's an SSL terminator, [...]
Yes.
[...] and so requests from those machines, from internal IP addresses, should be included?
In the end “should be included” is something you have to decide.
But if you see a request, whose ip column comes from a machine whose corresponding name has been listed in a pybal *-https file while the request was processed, it “typically” is a relayed request from the SSL terminator.
(Note the distinction between my “typcially is a relayed request from the SSL terminator” and your “should be included”.)
Or: those are the SSL machines, find out their IP addresses and you find out the internal IPs that represent SSLd requests, rather than internally-generated traffic?
I cannot fully parse that sentence. But it sounds a bit like SSL traffic would not be internally-generated traffic. From the logging perspective, SSL traffic is internally-generated traffic:
The SSL terminator performs a separate, genuinely fresh and new request to the caches.
This separate, genuinely fresh and new request gets logged. And that's the log line you're after, if you want to look at https traffic from within Hive.
Have fun, Christian
[1] 'host' field
[2] 'enabled' field
[3] see URL
[4] 'weight' field. You probably need not care about the weight. The weight tells you how much of the overall traffic a node gets. In the given file, all hosts have weight 1, so they all get a similar sized part of the overall traffic.
[5] Either anonymized on-list, or else for example through a command that we can run on stat1002.
On 15 December 2014 at 07:35, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Thu, Dec 11, 2014 at 06:27:02PM -0500, Oliver Keyes wrote:
On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote:
[...] I'm not sure how to interpret the pybal,
The exemplary file linked above holds lines like
{ 'host': 'amssq36.esams.wmnet', 'weight': 1, 'enabled': True }
Such a line means:
The host 'amssq36.esams.wmnet' [1] is [2] an SSL terminator for text cluster in esams [3], and has weight 1 [4].
Essentially; we want to be excluding internal IP spaces, because that contains a lot of automatically-generated traffic (fundraising, I'm looking at you)
Oliver, I do not like blaming games. You blamed Fundraising before to cause lots of internal requests. And I called you out on that before to please provide an example. However, you failed to provide an example. And yet you call out Fundraising again.
Please provide an example [5] of such traffic, so we're all on the same page.
It's hard to pull out, but they're requests with a PhantomJS user-agent that hit a large number of places to test banners. To be clear (my initial email was not clear) this is not a serious "damn you fundraising! Damn you all to heck!" but a joking one ;p. They do fantastic work and the requests they make to test banner appearance is part of that work. FWIW, I was informed that they were doing this...by Fundraising ;p. If you'd like more confirmation than that, I can talk to them and grab a specific example.
So, we exclude all requests from IPs within our ranges. Except, then we also exclude all the SSL traffic, since that will appear to come from an internal IP address, from the point of view of the request logs.
So, do I interpret this pybal as: if it's tagged as HTTPS,
Since you use 'tag' in different contexts around https, let me clarify how I read 'tag' here. I read it as “If a pybal *-https file lists a host as enabled with positive weight in a line that is not commented out"
it's an SSL terminator, [...]
Yes.
[...] and so requests from those machines, from internal IP addresses, should be included?
In the end “should be included” is something you have to decide.
But if you see a request, whose ip column comes from a machine whose corresponding name has been listed in a pybal *-https file while the request was processed, it “typically” is a relayed request from the SSL terminator.
(Note the distinction between my “typcially is a relayed request from the SSL terminator” and your “should be included”.)
Awesome :). We'll never get certainty - getting "most of the time" is, I think, Good Enough (tm).
Or: those are the SSL machines, find out their IP addresses and you find out the internal IPs that represent SSLd requests, rather than internally-generated traffic?
I cannot fully parse that sentence. But it sounds a bit like SSL traffic would not be internally-generated traffic. From the logging perspective, SSL traffic is internally-generated traffic:
The SSL terminator performs a separate, genuinely fresh and new request to the caches.
This separate, genuinely fresh and new request gets logged. And that's the log line you're after, if you want to look at https traffic from within Hive.
Gotcha. So, if we wanted to exclude internally-generated traffic most of the time, without unduly punishing HTTPs traffic, we'd be looking at a heuristic that looks something like:
*If the request comes from a WMF IP range; **Exclude, unless; ***The request is to a host listed as https=1 in the pyball file
If I'm reading right?
Have fun, Christian
[1] 'host' field
[2] 'enabled' field
[3] see URL
[4] 'weight' field. You probably need not care about the weight. The weight tells you how much of the overall traffic a node gets. In the given file, all hosts have weight 1, so they all get a similar sized part of the overall traffic.
[5] Either anonymized on-list, or else for example through a command that we can run on stat1002.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Oliver,
On Mon, Dec 15, 2014 at 07:40:47AM -0500, Oliver Keyes wrote:
On 15 December 2014 at 07:35, Christian Aistleitner < christian@quelltextlich.at> wrote:
On Thu, Dec 11, 2014 at 06:27:02PM -0500, Oliver Keyes wrote:
[ Fundraising is doing manyinternal-requests ]
Please provide an example [5] of such traffic, so we're all on the same page.
It's hard to pull out, but they're requests with a PhantomJS user-agent that hit a large number of places to test banners.
Please let's stop being vague, and share your knowledge with us. Show real examples of what you are referring to.
If I look at /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz on stats1002, I find 7689325 log lines [1]. Of those log lines, 354335 are from internal IPs [2]. Of those log lines from internal IPs, only 95 came with a PhantomJS User-Agent [3]. Even if all of those 95 are from Fundrasing, that's only 0.027% of internal traffic, and 0.0012% of overall traffic.
That's less than I'd consider as:
a lot of automatically-generated traffic
So obviously I am getting it wrong. Please, help me understand what you are referring to. Please, provide real examples.
FWIW, I was informed that they were doing this...by Fundraising ;p. If you'd like more confirmation than that, I can talk to them and grab a specific example.
That's fine and actually that's pretty great that Fundraising informed you about this. Kudos to Fundraising!
However, doing analysis based on word-of-mouth is bound to fail. We can only analyze what's in our logs.
But it might be that our logs are wrong. If Fundraising is really doing really many internal requests (How many?), we're maybe not logging those requests properly, because they don't show up as significant in our logs.
So, if we wanted to exclude internally-generated traffic most of the time, without unduly punishing HTTPs traffic, we'd be looking at a heuristic that looks something like:
*If the request comes from a WMF IP range; **Exclude, unless; ***The request is to a host listed as https=1 in the pyball file
(Nit-picking on the last step in [4])
The heuristic does not address two settings: * It would count HTTPS traffic from internal nodes * It would throw away labs' HTTP traffic
They are both low in volume, so I am not sure whether you want to care about them. But as they've been mentioned by others at some point in previous discussions, I am calling them out nonetheless.
* It would count HTTPS traffic from internal nodes
If for example I request
https://en.wikipedia.org/wiki/Foo
on stat1002, the IP in the logs would be the SSL terminator, so it would not get excluded, although the request originated from an internal machine. (X-Forwarded-For FTW!)
* It would throw away labs' HTTP traffic
When requesting
http://en.wikipedia.org/wiki/Foo
from a labs instance, the request is from an internal IP and the IP is not an SSL terminator. So the request would get thrown away.
But at some point you said that labs traffic should not get discarded immediately.
Have fun, Christian
[1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 17:11:44 // exit code: 0 cwd: ~ zcat /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz | wc -l 7689325
[2] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 17:12:16 // exit code: 0 cwd: ~ zcat /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz | cut -f 5 | grep -c '^(91.198.174.|208.80.15[2345].|198.35.2[67].|185.15.5[6789].|10.0.0.|2620:0:86[0123]:|2a02:ec80:)' 354335
[3] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 17:13:01 // exit code: 0 cwd: ~ zcat /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz | cut -f 5,14 | grep -c '^(91.198.174.|208.80.15[2345].|198.35.2[67].|185.15.5[6789].|10.0.0.|2620:0:86[0123]:|2a02:ec80:).*PhantomJS' 95
[4] Assuming the third line was meant to read something along the lines of
***The request is from a host listed (with positive weight, enabled, and not commented out) in a *https file in pybal while the request got processed
as there is no https=1 in pybal files. And requests /to/ SSL terminators do not make it into Hive. Only requests /from/ the SSL terminators do.