> For virtual pageviews, people will probably be more interested in reports that belong to the first group (summing them up with normal pageviews, breaking them down along the dimensions that are relevant for web traffic, counting them for a given URL etc).
Ah! Ok I get this use case now. I might not be able to comment about this much then. I think this totally changes the meaning of a pageview. Perhaps this is what you want? If so, this is outside the realm of my opinionatedness. :)
However, IF you do convince folks to change the meaning of ‘pageview’ to include ‘previews’, then we might be able to compromise. All I object to more filtering of webrequests :) The rest of this email might be moot if we don’t change the ‘pageview definition’, but I’ll continue anyway…
The page previews data could come in as events. Augmenting the generated pageviews table from more incoming event sources sounds more flexible than doing more filtering logic in webrequests. I’d defer to the Analytics team members who would be implementing this though, I might be wrong.
In my ideal, pageviews and page_previews would both be separate event streams. These would be imported as is to Hive tables, but also available in Kafka. You could join these together in a broader ‘content consumption’ dataset somehow, either in Hadoop with batch jobs, or more realtime with streaming jobs. (If this is done right, you can even use the same code for both cases.) If we had a good stream processing system here, I might suggest that we move pageview filtering to a more realtime setup and generate a derived pageview stream in Kafka. We’d then that as the source of pageviews in Hadoop. Anyway, this is my ideal setup, but not what we have now! But we might one day (in the next FY???), and intaking events for page previews and other counters will help us migrate to this kind of architecture later.
> Is that different from preprocessing them via EventLogging? Either way you take a HTTP request, and end up with a Hadoop record - is there something that makes that process a lot more costly for normal pageviews than EventLogging beacon hits?
From a hardware perspective, only in that the stream of events is much smaller, so there’s less wasted repeated I/O. From a engineering time perspective, if we use the webrequest tagging system to do this, I think we’re good, but only in the short term. In the long term, it hides the complexity involved in maintaining the logic of what a pageview or page preview or any other ‘tagged’ webrequest in complicated Java logic that is really only useable in Hadoop. I’m mainly objecting because we want to draw a line to stop doing this kind of thing. Doing this for page previews now might be ok if we really really really have to (although Nuria might not agree ;) ), but ultimately we need to push this kind of interaction logic out to feature developers who have more control over it.
The Analytics team wants to build infrastructure that make it easy for developers to measure their product usage, not implement the measuring logic ourselves.