Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

19 Jan 2018

...
  For example, UI instrumentations on the web are almost
always sampled, because that yields enough data to answer UI questions - but on the
other
hand tend to record much more detail about the individual interaction. In
contrast, we register all pageviews unsampled, but don't keep a permanent
record of every single one of them with precise timestamps - rather, we
have aggregated tables (pageview_hourly in particular). Our EventLogging
backend is not tailored to that.

When you say “Our EventLogging backend here”, what are you referring to?
If MySQL, then for sure. :)

...
  Storing data about seen previews in the same way as we
do for pageviews, for example in the pageview_hourly (suitably tagged, perhaps
giving that
table a more general name) would facilitate that a lot, by allowing us to
largely reuse the work that during the past few years went into getting
pageview aggregation right.

I’m not totally opposed to doing it this way, but at some point we need to
realize that this isn’t a scalable (human and CPU resource wise) way to
measure user feature interaction.

I don’t think a pageview is inherently different than any other kind of
impression, it’s just that we didn’t have the ability in the past (or now?)
for pageviews to be collected and measured like they should.  If we were
designing an interaction measurement system now, it wouldn’t look exactly
like EventLogging, but it would look like something close to it.  And if it
did everything I’d want it to, we would use it to measure pageviews and
everything else you’ve mentioned.

Making events be the source of truth is more accurate than implementing
custom batch logic in Hadoop to comb through webrequests and filter out
what you are looking for.  It pushes control of the definition of what
counts as a ‘pageview’ or ‘page preview’ to the folks who are developing
the app/website/feature.  If we use webrequests+Hadoop tagging to count
these, any time in the future there is a change to the URLs that page
previews load (or the beacon URLs they hit), we’d have to make a patch to
the tagging logic and release and deploy a new refinery version to account
for the change.  Any time a new feature is added for which someone wants
interactions counted, we have to do the same.

Heck, if you use events, you could very easily consume and/or aggregate or
emit them to anywhere you wanted.  Your own datastore, a grafana dashboard,
a monitoring system, etc. etc. :)  It also will help us to standardize this
type of thing, so that in the future creation of new dashboards can be more
automated.

On Thu, Jan 18, 2018 at 6:17 PM, Tilman Bayer &lt;tbayer(a)wikimedia.org&gt; wrote:

>
> On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz &lt;nuria(a)wikimedia.org&gt; wrote:
>
>> Gergo,
>>
>> >while EventLogging data gets stored in a different, unrelated way
>> Not really, This has changed quite a bit as of the last two quarters.
>> Eventlogging data as of recent gets preprocessed and refined similar to how
>> webrequest data is preprocessed and refined. You can have a dashboard on
>> top of some eventlogging schemas on superset in the same way you have a
>> dashboard that displays pageview data on superset.
>>
>
> I don't see how this addresses Gergo's larger point about the difference
> between consistently tallying content consumption (pageviews, previews,
> mediaviewer image views) and analyzing UI interactions (which is the main
> use case that EventLogging has been developed and used for). There are
> really quite a few differences between these two. For example, UI
> instrumentations on the web are almost always sampled, because that yields
> enough data to answer UI questions - but on the other hand tend to record
> much more detail about the individual interaction. In contrast, we register
> all pageviews unsampled, but don't keep a permanent record of every single
> one of them with precise timestamps - rather, we have aggregated tables
> (pageview_hourly in particular). Our EventLogging backend is not tailored
> to that.
>
>
>
>>
>> See dashboards on superset (user required).
>>
>> https://superset.wikimedia.org/superset/dashboard/7/?presele
>> ct_filters=%7B%7D
>>
>> And (again, user required) EL data on druid, this very same data we are
>> talking about, page previews:
>>
>> https://pivot.wikimedia.org/#tbayer_popups
>>
>
> That's actually not the "very same data we are talking about". You can
> rest assured that the web team (and Sam in particular) has already been
> aware of the existence of the Popups instrumentation for page previews. The
> team spent considerable effort building it in order to understand how users
> interact with the feature's UI. Now comes the separate effort of
> systematically tallying content consumption from this new channel. Superset
> and Pivot are great, but are nowhere near providing all the ways that WMF
> analysts and community members currently have to study pageview data.
...
  Storing data about seen previews in the same way as we
do for pageviews, > for example in the pageview_hourly (suitably tagged, perhaps
giving that
> table a more general name) would facilitate that a lot, by allowing us to
> largely reuse the work that during the past few years went into getting
> pageview aggregation right.
>
>
>>
>> >I was going to make the point that #2 already has a processing pipeline
>> established whereas #1 doesn't.
>> This is incorrect, we mark as "preview" data that we want to exclude
>> from processing, see:
>> https://github.com/wikimedia/analytics-refinery-source/blob/
>> master/refinery-core/src/main/java/org/wikimedia/analytics/r
>> efinery/core/PageviewDefinition.java#L144
>> Naming is unfortunate but previews are really "preloads" as in
requests
>> we make (and cache locally) and maybe shown to users or not.
>>
>>
>> But again, tracking of events is better done on an event based system and
>> EL is such a system.
>>
>>
>> Again, tracking of individual events is not the ultimate goal here.
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews