On 15 December 2014 at 07:35, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi,
On Thu, Dec 11, 2014 at 06:27:02PM -0500, Oliver Keyes wrote:
On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote:
[...] I'm not sure how to interpret the pybal,
The exemplary file linked above holds lines like
{ 'host': 'amssq36.esams.wmnet', 'weight': 1, 'enabled': True }
Such a line means:
The host 'amssq36.esams.wmnet' [1] is [2] an SSL terminator for text cluster in esams [3], and has weight 1 [4].
Essentially; we want to be excluding internal IP spaces, because that contains a lot of automatically-generated traffic (fundraising, I'm looking at you)
Oliver, I do not like blaming games. You blamed Fundraising before to cause lots of internal requests. And I called you out on that before to please provide an example. However, you failed to provide an example. And yet you call out Fundraising again.
Please provide an example [5] of such traffic, so we're all on the same page.
It's hard to pull out, but they're requests with a PhantomJS user-agent that hit a large number of places to test banners. To be clear (my initial email was not clear) this is not a serious "damn you fundraising! Damn you all to heck!" but a joking one ;p. They do fantastic work and the requests they make to test banner appearance is part of that work. FWIW, I was informed that they were doing this...by Fundraising ;p. If you'd like more confirmation than that, I can talk to them and grab a specific example.
So, we exclude all requests from IPs within our ranges. Except, then we also exclude all the SSL traffic, since that will appear to come from an internal IP address, from the point of view of the request logs.
So, do I interpret this pybal as: if it's tagged as HTTPS,
Since you use 'tag' in different contexts around https, let me clarify how I read 'tag' here. I read it as “If a pybal *-https file lists a host as enabled with positive weight in a line that is not commented out"
it's an SSL terminator, [...]
Yes.
[...] and so requests from those machines, from internal IP addresses, should be included?
In the end “should be included” is something you have to decide.
But if you see a request, whose ip column comes from a machine whose corresponding name has been listed in a pybal *-https file while the request was processed, it “typically” is a relayed request from the SSL terminator.
(Note the distinction between my “typcially is a relayed request from the SSL terminator” and your “should be included”.)
Awesome :). We'll never get certainty - getting "most of the time" is, I think, Good Enough (tm).
Or: those are the SSL machines, find out their IP addresses and you find out the internal IPs that represent SSLd requests, rather than internally-generated traffic?
I cannot fully parse that sentence. But it sounds a bit like SSL traffic would not be internally-generated traffic. From the logging perspective, SSL traffic is internally-generated traffic:
The SSL terminator performs a separate, genuinely fresh and new request to the caches.
This separate, genuinely fresh and new request gets logged. And that's the log line you're after, if you want to look at https traffic from within Hive.
Gotcha. So, if we wanted to exclude internally-generated traffic most of the time, without unduly punishing HTTPs traffic, we'd be looking at a heuristic that looks something like:
*If the request comes from a WMF IP range; **Exclude, unless; ***The request is to a host listed as https=1 in the pyball file
If I'm reading right?
Have fun, Christian
[1] 'host' field
[2] 'enabled' field
[3] see URL
[4] 'weight' field. You probably need not care about the weight. The weight tells you how much of the overall traffic a node gets. In the given file, all hosts have weight 1, so they all get a similar sized part of the overall traffic.
[5] Either anonymized on-list, or else for example through a command that we can run on stat1002.
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics