On Thu, Jan 18, 2018 at 3:56 PM, Nuria Ruiz <nuria@wikimedia.org> wrote:
Event logging use cases are events, as we move to a thicker client -more javascript heavy- you will be needing to measure events for -nearly- everything, whether those are to be consider "content consumption"  or "ui interaction" is not that relevant. Example: video plays are content consumption and are also "ui interactions". 

That could be an argument for not separating pageviews from events (in which the question of whether virtual pageviews should be more like pageviews or more like events would be moot), but given that those *are* separated I don't see how it applies. In the current analytics setup, and given what kinds of frontends are currently supported, there are types of report generation that are easier to perform on pageviews and not so easy on events, and other types of report generation that are easier to do on events. For virtual pageviews, people will probably be more interested in reports that belong to the first group (summing them up with normal pageviews, breaking them down along the dimensions that are relevant for web traffic, counting them for a given URL etc).

On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto <otto@wikimedia.org> wrote:
the beacon puts the record into the webrequest table and from there it would only take some trivial preprocessing
‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!

Is that different from preprocessing them via EventLogging? Either way you take a HTTP request, and end up with a Hadoop record - is there something that makes that process a lot more costly for normal pageviews than EventLogging beacon hits? 

Anyway what I meant by trivial preprocessing is that you take something like http://bits.wikimedia.org/beacon/page-preview?duration=123&uri=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFoo, convert it into https://en.wikipedia.org/wiki/Foo, tack the duration and the type ('page-preview') into some extra fields, add those extra fields to the dimensions along which pageviews can be inspected, and you have integrated virtual views into your analytics APIs / UIs, almost for free. The alternative would be that every analytics customer who wants to deal with content consumption and does not want to automatically filter out content consumption happening via thick clients would have to update their interfaces and do some kind of union query to merge the data that's now distributed between the webrequest table and one or more EventLogging tables; surely that's less expedient?

If we use webrequests+Hadoop tagging to count these, any time in the future there is a change to the URLs that page previews load (or the beacon URLs they hit), we’d have to make a patch to the tagging logic and release and deploy a new refinery version to account for the change.  Any time a new feature is added for which someone wants interactions counted, we have to do the same.

There doesn't seem to be much reason for the beacon URL to ever change. As for new beacon endpoints (new virtual view types), why can't that just be a whitelist that's offloaded to configuration?