On 15 December 2014 at 07:35, Christian Aistleitner <christian@quelltextlich.at> wrote:
Hi,

On Thu, Dec 11, 2014 at 06:27:02PM -0500, Oliver Keyes wrote:
> > On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote:
> > >   http://config-master.wikimedia.org/pybal/esams/text-https
> [...]
> I'm not sure how to interpret the pybal,

The exemplary file linked above holds lines like

  { 'host': 'amssq36.esams.wmnet', 'weight': 1, 'enabled': True }

Such a line means:

  The host 'amssq36.esams.wmnet' [1] is [2] an SSL terminator for text
  cluster in esams [3], and has weight 1 [4].



> Essentially; we want
> to be excluding internal IP spaces, because that contains a lot of
> automatically-generated traffic (fundraising, I'm looking at you)

Oliver, I do not like blaming games.
You blamed Fundraising before to cause lots of internal requests.
And I called you out on that before to please provide an example.
However, you failed to provide an example. And yet you call out
Fundraising again.

Please provide an example [5] of such traffic, so we're all on the
same page.

It's hard to pull out, but they're requests with a PhantomJS user-agent that hit a large number of places to test banners. To be clear (my initial email was not clear) this is not a serious "damn you fundraising! Damn you all to heck!" but a joking one ;p. They do fantastic work and the requests they make to test banner appearance is part of that work. FWIW, I was informed that they were doing this...by Fundraising ;p. If you'd like more confirmation than that, I can talk to them and grab a specific example.
 



> So, we
> exclude all requests from IPs within our ranges. Except, then we also
> exclude all the SSL traffic, since that will appear to come from an
> internal IP address, from the point of view of the request logs.
>
> So, do I interpret this pybal as: if it's tagged as HTTPS,

Since you use 'tag' in different contexts around https, let me clarify
how I read 'tag' here. I read it as “If a pybal *-https file lists a
host as enabled with positive weight in a line that is not commented
out"



> it's an SSL
> terminator, [...]

Yes.



> [...] and so requests from those machines, from internal IP
> addresses, should be included?

In the end “should be included” is something you have to decide.

But if you see a request, whose ip column comes from a machine whose
corresponding name has been listed in a pybal *-https file while the
request was processed, it “typically” is a relayed request from the
SSL terminator.

(Note the distinction between my “typcially is a relayed request from
the SSL terminator” and your “should be included”.)

Awesome :). We'll never get certainty - getting "most of the time" is, I think, Good Enough (tm).
 


> Or: those are the SSL machines, find out
> their IP addresses and you find out the internal IPs that represent SSLd
> requests, rather than internally-generated traffic?

I cannot fully parse that sentence.
But it sounds a bit like SSL traffic would not be internally-generated
traffic.
>From the logging perspective, SSL traffic is internally-generated traffic:

  The SSL terminator performs a separate, genuinely fresh and new
  request to the caches.

This separate, genuinely fresh and new request gets logged. And that's
the log line you're after, if you want to look at https traffic from
within Hive.


Gotcha. So, if we wanted to exclude internally-generated traffic most of the time, without unduly punishing HTTPs traffic, we'd be looking at a heuristic that looks something like:

*If the request comes from a WMF IP range;
**Exclude, unless;
***The request is to a host listed as https=1 in the pyball file

If I'm reading right?
 


Have fun,
Christian



[1] 'host' field

[2] 'enabled' field

[3] see URL

[4] 'weight' field. You probably need not care about the weight. The
weight tells you how much of the overall traffic a node gets. In the
given file, all hosts have weight 1, so they all get a similar sized
part of the overall traffic.

[5] Either anonymized on-list, or else for example through a command
that we can run on stat1002.



--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christian@quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



--
Oliver Keyes
Research Analyst
Wikimedia Foundation