Hi Oliver,
On Mon, Dec 15, 2014 at 07:40:47AM -0500, Oliver Keyes wrote:
On 15 December 2014 at 07:35, Christian Aistleitner
<
christian(a)quelltextlich.at> wrote:
On Thu, Dec 11, 2014 at 06:27:02PM -0500, Oliver
Keyes wrote:
[ Fundraising is doing manyinternal-requests ]
Please provide an example [5] of such traffic, so we're all on the
same page.
It's hard to pull out, but they're requests with a PhantomJS user-agent
that hit a large number of places to test banners.
Please let's stop being vague, and share your knowledge with us.
Show real examples of what you are referring to.
If I look at
/a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz
on stats1002, I find 7689325 log lines [1].
Of those log lines, 354335 are from internal IPs [2].
Of those log lines from internal IPs, only 95 came with a PhantomJS
User-Agent [3].
Even if all of those 95 are from Fundrasing, that's only 0.027% of
internal traffic, and 0.0012% of overall traffic.
That's less than I'd consider as:
> > a lot of
> > automatically-generated traffic
So obviously I am getting it wrong.
Please, help me understand what you are referring to.
Please, provide real examples.
FWIW, I was
informed that they were doing this...by Fundraising ;p. If you'd like more
confirmation than that, I can talk to them and grab a specific example.
That's fine and actually that's pretty great that Fundraising informed
you about this. Kudos to Fundraising!
However, doing analysis based on word-of-mouth is bound to fail.
We can only analyze what's in our logs.
But it might be that our logs are wrong.
If Fundraising is really doing really many internal requests (How
many?), we're maybe not logging those requests properly, because they
don't show up as significant in our logs.
So, if we wanted to exclude internally-generated
traffic most of
the time, without unduly punishing HTTPs traffic, we'd be looking at a
heuristic that looks something like:
*If the request comes from a WMF IP range;
**Exclude, unless;
***The request is to a host listed as https=1 in the pyball file
(Nit-picking on the last step in [4])
The heuristic does not address two settings:
* It would count HTTPS traffic from internal nodes
* It would throw away labs' HTTP traffic
They are both low in volume, so I am not sure whether you want to care
about them. But as they've been mentioned by others at some point in
previous discussions, I am calling them out nonetheless.
* It would count HTTPS traffic from internal nodes
If for example I request
https://en.wikipedia.org/wiki/Foo
on stat1002, the IP in the logs would be the SSL terminator, so it
would not get excluded, although the request originated from an
internal machine. (X-Forwarded-For FTW!)
* It would throw away labs' HTTP traffic
When requesting
http://en.wikipedia.org/wiki/Foo
from a labs instance, the request is from an internal IP and the IP is
not an SSL terminator. So the request would get thrown away.
But at some point you said that labs traffic should not get discarded
immediately.
Have fun,
Christian
[1]
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 17:11:44 // exit code: 0
cwd: ~
zcat /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz | wc -l
7689325
[2]
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 17:12:16 // exit code: 0
cwd: ~
zcat /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz | cut -f 5 | grep -c
'^\(91\.198\.174\.\|208\.80\.15[2345]\.\|198\.35\.2[67]\.\|185\.15\.5[6789]\.\|10\.0\.0\.\|2620:0:86[0123]:\|2a02:ec80:\)'
354335
[3]
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 17:13:01 // exit code: 0
cwd: ~
zcat /a/squid/archive/sampled/sampled-1000.tsv.log-20141215.gz | cut -f 5,14 | grep -c
'^\(91\.198\.174\.\|208\.80\.15[2345]\.\|198\.35\.2[67]\.\|185\.15\.5[6789]\.\|10\.0\.0\.\|2620:0:86[0123]:\|2a02:ec80:\).*PhantomJS'
95
[4] Assuming the third line was meant to read something along the lines of
***The request is from a host listed (with positive weight,
enabled, and not commented out) in a *https file in pybal while
the request got processed
as there is no https=1 in pybal files. And requests /to/ SSL
terminators do not make it into Hive. Only requests /from/ the SSL
terminators do.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage:
http://quelltextlich.at/
---------------------------------------------------------------