Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

19 Jan 2018

On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz &lt;nuria(a)wikimedia.org&gt; wrote:

...
  Gergo,

 while EventLogging data gets stored in a
different, unrelated way  Not really, This has changed quite a bit as of the last
two quarters.
 Eventlogging data as of recent gets preprocessed and refined similar to how
 webrequest data is preprocessed and refined. You can have a dashboard on
 top of some eventlogging schemas on superset in the same way you have a
 dashboard that displays pageview data on superset.

I don't see how this addresses Gergo's larger point about the difference
between consistently tallying content consumption (pageviews, previews,
mediaviewer image views) and analyzing UI interactions (which is the main
use case that EventLogging has been developed and used for). There are
really quite a few differences between these two. For example, UI
instrumentations on the web are almost always sampled, because that yields
enough data to answer UI questions - but on the other hand tend to record
much more detail about the individual interaction. In contrast, we register
all pageviews unsampled, but don't keep a permanent record of every single
one of them with precise timestamps - rather, we have aggregated tables
(pageview_hourly in particular). Our EventLogging backend is not tailored
to that.

...

 See dashboards on superset (user required).

 https://superset.wikimedia.org/superset/dashboard/7/?presele
 ct_filters=%7B%7D

 And (again, user required) EL data on druid, this very same data we are
 talking about, page previews:

 https://pivot.wikimedia.org/#tbayer_popups

That's actually not the "very same data we are talking about". You can rest
assured that the web team (and Sam in particular) has already been aware of
the existence of the Popups instrumentation for page previews. The team
spent considerable effort building it in order to understand how users
interact with the feature's UI. Now comes the separate effort of
systematically tallying content consumption from this new channel. Superset
and Pivot are great, but are nowhere near providing all the ways that WMF
analysts and community members currently have to study pageview data.
Storing data about seen previews in the same way as we do for pageviews,
for example in the pageview_hourly (suitably tagged, perhaps giving that
table a more general name) would facilitate that a lot, by allowing us to
largely reuse the work that during the past few years went into getting
pageview aggregation right.

...

 I was going to make the point that #2 already has
a processing pipeline  established whereas #1 doesn't.
 This is incorrect, we mark as "preview" data that we want to exclude from
 processing, see:
 https://github.com/wikimedia/analytics-refinery-source/blob/
 master/refinery-core/src/main/java/org/wikimedia/analytics/r
 efinery/core/PageviewDefinition.java#L144
 Naming is unfortunate but previews are really "preloads" as in requests we
 make (and cache locally) and maybe shown to users or not.

 But again, tracking of events is better done on an event based system and
 EL is such a system.

 Again, tracking of individual events is not the ultimate goal here. 

-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews