Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

30 Jan 2018

...
 I’m not totally sure if this works for you all, but I
had pictured generating aggregates from the page preview events, and then joining
the
page preview aggregates with the >pageview aggregates into a new table with
an extra dimension specifying which type of content view was made.

On my opinion the aggregated data should stay in two different tables. I
can see a future where the preview data is of different types (might
include rich media that was/was not played, there are simple popups and
"richer" ones .. whatever) and the dimensions where you represent this
consumption are not going to match with pageview_hourly which again only
represents well full page loads.

On Tue, Jan 30, 2018 at 12:02 AM, Andrew Otto &lt;otto(a)wikimedia.org&gt; wrote:

...
  CoOOOl :)

  Using the GeoIP cookie will require reconfiguring
the EventLogging  varnishkafka instance [0]

 I’m not familiar with this cookie, but, if we used it, I thought it would
 be sent back to by the client in the event. E.g. event.country =
 response.headers.country; EventLogging.emit(event);

 That way, there’s no additional special logic needed on the server side to
 geocode or populate the country in the event.

 However, if y’all can’t or don’t want to use the country cookie, then
 yaaa, we gotta figure out what to do about IPs and geocoding in
 EventLogging. There are a few options here, but none of them are great. The
 options basically are variations on ‘treat this event schema as special and
 make special conditionals in EventLogging processor code’, or, 'include IP
 and/or geocode all events in all schemas'. We’re not sure which we want to
 do yet, but we did mention this at our offsite today. I think we’ll figure
 this out and make it happen in the next week or two. Whatever the
 implementation ends up being, we’ll get geocoded data into this dataset.

  Is the geocoding code that we use on
webrequest_raw available as an  Hive UDF or in PySpark?
 The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive
 UDF

<https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/GetGeoDataUDF.java>
 which ultimately just calls this getGeocodedData

<https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Geocode.java#L138>
 function, which itself is just a wrapper around the Maxmind API. We may end
 up doing geocoding in the EventLogging server codebase (again, really not
 sure about this yet…), but if we do it will use the same Maxmind databases.

  Aggregating the EventLogging data in the same way
that we aggregate  webrequest data into pageviews data will require either:
replicating the
 process that does this and keeping the two processes in sync; or
 abstracting away the source table from the aggregation process so that it
 can work on both tables

 I’m not totally sure if this works for you all, but I had pictured
 generating aggregates from the page preview events, and then joining the
 page preview aggregates with the pageview aggregates into a new table with
 an extra dimension specifying which type of content view was made.

   I’d appreciate it if someone could estimate how
much work it will be  to implement GeoIP information and the other fields from
Pageview hourly
 for EventLogging events

 Ya we gotta figure this out still, but actual implementation shouldn’t be
 difficult, however we decide to do it.

 On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith &lt;samsmith(a)wikimedia.org&gt;
 wrote:

 *Hullo all,It seems like we've arrived at an implementation for the
 client-side (JS) part of this problem: use EventLogging to track a page
 interaction from within the Page Previews code. This'll give us the
 flexibility to take advantage of a stream processing solution if/when it
 becomes available, to push the definition of a "Page Previews page
 interaction" to the client, and to rely on any events that we log in the
 immediate future ending up in tables that we're already familiar with.In
 principle, I agree with Andrew's argument that adding additional filtering
 logic to the webrequest refinement process will make it harder to change
 existing definitions of views or add others in future. In practice though,
 we'll need to: - Ensure that the server-side EventLogging component records
 metadata consistent with with our existing content consumption measurement,
 concretely: the fields available in the
 https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_ho…
 <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly>
 table. In particular, that it either doesn't discard the client IP or
 utilizes the GeoIP cookie sent by the client for this schema.- Aggregate
 the resulting table so that it can be combined with the pageviews table to
 generate reports.- Ensure that the events aren't recorded in MySQL.Using
 the GeoIP cookie will require reconfiguring the EventLogging varnishkafka
 instance [0], and raises questions about the compatibility with the
 corresponding field in the pageviews data. Retaining the client IP will
 require a similar change but will also require that we share the geocoding
 code with whatever process we use to refine the data that we’re capturing
 via EventLogging. Is the geocoding code that we use on webrequest_raw
 available as an Hive UDF or in PySpark?Aggregating the EventLogging data in
 the same way that we aggregate webrequest data into pageviews data will
 require either: replicating the process that does this and keeping the two
 processes in sync; or abstracting away the source table from the
 aggregation process so that it can work on both tables. We’ll have to
 maintain the chosen approach until it’s superseded by a stream processing
 solution, the timeline of which is currently measured in years.My next
 steps are making sure that Audiences Product's requirements are all visible
 and to work with Tilman Bayer to create a schema that's suitable for our
 purposes but hopefully useful to others. Nuria has also offered to give a
 technical overview of EventLogging, which I think would be a great resource
 for everyone so I'll look into setting up a meeting. I'd appreciate it if
 someone could estimate how much work it will be to implement GeoIP
 information and the other fields from Pageview hourly for EventLogging
 events on a per-schema basis.Thanks,-Sam[0]

https://phabricator.wikimedia.org/source/operations-puppet/browse/productio…

<https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37>*

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews