For virtual pageviews, people will probably be more
reports that belong to the first group (summing them up with normal
pageviews, breaking them down along the dimensions that are relevant for
web traffic, counting them for a given URL etc).
Ah! Ok I get this use case now. I might not be able to comment about this
much then. I think this totally changes the meaning of a pageview.
Perhaps this is what you want? If so, this is outside the realm of my
However, IF you do convince folks to change the meaning of ‘pageview’ to
include ‘previews’, then we might be able to compromise. All I object to
more filtering of webrequests :) The rest of this email might be moot if
we don’t change the ‘pageview definition’, but I’ll continue anyway…
The page previews data could come in as events. Augmenting the generated
pageviews table from more incoming event sources sounds more flexible than
doing more filtering logic in webrequests. I’d defer to the Analytics team
members who would be implementing this though, I might be wrong.
In my ideal, pageviews and page_previews would both be separate event
streams. These would be imported as is to Hive tables, but also available
in Kafka. You could join these together in a broader ‘content consumption’
dataset somehow, either in Hadoop with batch jobs, or more realtime with
streaming jobs. (If this is done right, you can even use the same code for
both cases.) If we had a good stream processing system here, I might
suggest that we move pageview filtering to a more realtime setup and
generate a derived pageview stream in Kafka. We’d then that as the source
of pageviews in Hadoop. Anyway, this is my ideal setup, but not what we
have now! But we might one day (in the next FY???), and intaking events
for page previews and other counters will help us migrate to this kind
of architecture later.
Is that different from preprocessing them via
EventLogging? Either way
you take a HTTP request, and end up with a Hadoop record -
something that makes that process a lot more costly for normal pageviews
than EventLogging beacon hits?
From a hardware perspective, only in that the stream of
events is much
smaller, so there’s less wasted repeated I/O. From a engineering
perspective, if we use the webrequest tagging system to do this, I think
we’re good, but only in the short term. In the long term, it hides the
complexity involved in maintaining the logic of what a pageview or page
preview or any other ‘tagged’ webrequest in complicated Java logic that is
really only useable in Hadoop. I’m mainly objecting because we want to
draw a line to stop doing this kind of thing. Doing this for page previews
now might be ok if we really really really have to (although Nuria might
not agree ;) ), but ultimately we need to push this kind of interaction
logic out to feature developers who have more control over it.
The Analytics team wants to build infrastructure that make it easy for
developers to measure their product usage, not implement the measuring
On Fri, Jan 19, 2018 at 6:05 AM, Adam Baso <abaso(a)wikimedia.org> wrote:
Thanks, Sam. Nuria, that's what I was getting at -
if using the EL JS
library would some sort of new method be needed so that these impressions
On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith <samsmith(a)wikimedia.org> wrote:
On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso
Adding to this, one thing to consider is DNT - is
there a way to invoke
EL so that such traffic is appropriately imputed or something?
The EventLogging client respects DNT . When the user enables DNT,
mw.eventLog.logEvent is a NOP.
I don't see any mention of DNT in the Varnish VCLs around the the /beacon
endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
there's nothing stopping a client sending a well-formatted request to the
/beacon/event endpoint directly , ignoring the user's choice.
Analytics mailing list
Analytics mailing list