Hi Oliver,
On Mon, Dec 15, 2014 at 07:40:47AM -0500, Oliver Keyes wrote:
On 15 December 2014 at 07:35, Christian Aistleitner < christian@quelltextlich.at> wrote:
On Thu, Dec 11, 2014 at 06:27:02PM -0500, Oliver Keyes wrote:
[ Fundraising is doing manyinternal-requests ]
Please provide an example [5] of such traffic, so we're all on the same page.
It's hard to pull out, but they're requests with a PhantomJS user-agent that hit a large number of places to test banners.
Please let's stop being vague, and share your knowledge with us. Show real examples of what you are referring to.
If I look at /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz on stats1002, I find 7689325 log lines [1]. Of those log lines, 354335 are from internal IPs [2]. Of those log lines from internal IPs, only 95 came with a PhantomJS User-Agent [3]. Even if all of those 95 are from Fundrasing, that's only 0.027% of internal traffic, and 0.0012% of overall traffic.
That's less than I'd consider as:
a lot of automatically-generated traffic
So obviously I am getting it wrong. Please, help me understand what you are referring to. Please, provide real examples.
FWIW, I was informed that they were doing this...by Fundraising ;p. If you'd like more confirmation than that, I can talk to them and grab a specific example.
That's fine and actually that's pretty great that Fundraising informed you about this. Kudos to Fundraising!
However, doing analysis based on word-of-mouth is bound to fail. We can only analyze what's in our logs.
But it might be that our logs are wrong. If Fundraising is really doing really many internal requests (How many?), we're maybe not logging those requests properly, because they don't show up as significant in our logs.
So, if we wanted to exclude internally-generated traffic most of the time, without unduly punishing HTTPs traffic, we'd be looking at a heuristic that looks something like:
*If the request comes from a WMF IP range; **Exclude, unless; ***The request is to a host listed as https=1 in the pyball file
(Nit-picking on the last step in [4])
The heuristic does not address two settings: * It would count HTTPS traffic from internal nodes * It would throw away labs' HTTP traffic
They are both low in volume, so I am not sure whether you want to care about them. But as they've been mentioned by others at some point in previous discussions, I am calling them out nonetheless.
* It would count HTTPS traffic from internal nodes
If for example I request
https://en.wikipedia.org/wiki/Foo
on stat1002, the IP in the logs would be the SSL terminator, so it would not get excluded, although the request originated from an internal machine. (X-Forwarded-For FTW!)
* It would throw away labs' HTTP traffic
When requesting
http://en.wikipedia.org/wiki/Foo
from a labs instance, the request is from an internal IP and the IP is not an SSL terminator. So the request would get thrown away.
But at some point you said that labs traffic should not get discarded immediately.
Have fun, Christian
[1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 17:11:44 // exit code: 0 cwd: ~ zcat /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz | wc -l 7689325
[2] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 17:12:16 // exit code: 0 cwd: ~ zcat /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz | cut -f 5 | grep -c '^(91.198.174.|208.80.15[2345].|198.35.2[67].|185.15.5[6789].|10.0.0.|2620:0:86[0123]:|2a02:ec80:)' 354335
[3] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 17:13:01 // exit code: 0 cwd: ~ zcat /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz | cut -f 5,14 | grep -c '^(91.198.174.|208.80.15[2345].|198.35.2[67].|185.15.5[6789].|10.0.0.|2620:0:86[0123]:|2a02:ec80:).*PhantomJS' 95
[4] Assuming the third line was meant to read something along the lines of
***The request is from a host listed (with positive weight, enabled, and not commented out) in a *https file in pybal while the request got processed
as there is no https=1 in pybal files. And requests /to/ SSL terminators do not make it into Hive. Only requests /from/ the SSL terminators do.