Are you saying that the server load generated by such an additional aggregation query would be a blocker? If yes, how about we combine the two (for pageviews and previews) into one?

Sorry, no it isn’t a blocker.   The tagging logic that Nuria and others have been working on for a while now makes this a little easier, since the webrequests only need to be read once to add all tags.  It is separate than pageviews (for now), but we might use tagging for pageviews eventually too.

 I assume it could be quite analogous to the one your team has implemented for pageviews
If we did it like the linked Hive query, it would be quite a lot.  We don’t want to read every webrequest from disk for every aggregate dataset.  Tagging helps, since we define the set of tags and filters once, and the job that adds tags reads all webrequest once, and adds all tags.

But anyway, yes, it can be done.

I’m mostly objecting and recommending EventLogging because we really shouldn’t’ be doing searching webrequest to measure interactions over and over again.  It's fragile and monolithic and not very portable.  Events are better :)



On Thu, Jan 18, 2018 at 6:44 PM, Tilman Bayer <tbayer@wikimedia.org> wrote:
On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto <otto@wikimedia.org> wrote:
the beacon puts the record into the webrequest table and from there it would only take some trivial preprocessing
‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
 
I think Gergo may have been referring to the human work involved in implementing that preprocessing step. I assume it could be quite analogous to the one your team has implemented for pageviews: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/pageview/hourly/pageview_hourly.hql

Are you saying that the server load generated by such an additional aggregation query would be a blocker? If yes, how about we combine the two (for pageviews and previews) into one?


--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics