Hullo all,


It seems like we've arrived at an implementation for the client-side (JS) part of this problem: use EventLogging to track a page interaction from within the Page Previews code. This'll give us the flexibility to take advantage of a stream processing solution if/when it becomes available, to push the definition of a "Page Previews page interaction" to the client, and to rely on any events that we log in the immediate future ending up in tables that we're already familiar with.


In principle, I agree with Andrew's argument that adding additional filtering logic to the webrequest refinement process will make it harder to change existing definitions of views or add others in future. In practice though, we'll need to:


  • Ensure that the server-side EventLogging component records metadata consistent with with our existing content consumption measurement, concretely: the fields available in the https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly table. In particular, that it either doesn't discard the client IP or utilizes the GeoIP cookie sent by the client for this schema.

  • Aggregate the resulting table so that it can be combined with the pageviews table to generate reports.

  • Ensure that the events aren't recorded in MySQL.


Using the GeoIP cookie will require reconfiguring the EventLogging varnishkafka instance [0], and raises questions about the compatibility with the corresponding field in the pageviews data. Retaining the client IP will require a similar change but will also require that we share the geocoding code with whatever process we use to refine the data that we’re capturing via EventLogging. Is the geocoding code that we use on webrequest_raw available as an Hive UDF or in PySpark?


Aggregating the EventLogging data in the same way that we aggregate webrequest data into pageviews data will require either: replicating the process that does this and keeping the two processes in sync; or abstracting away the source table from the aggregation process so that it can work on both tables. We’ll have to maintain the chosen approach until it’s superseded by a stream processing solution, the timeline of which is currently measured in years.


My next steps are making sure that Audiences Product's requirements are all visible and to work with Tilman Bayer to create a schema that's suitable for our purposes but hopefully useful to others. Nuria has also offered to give a technical overview of EventLogging, which I think would be a great resource for everyone so I'll look into setting up a meeting. I'd appreciate it if someone could estimate how much work it will be to implement GeoIP information and the other fields from Pageview hourly for EventLogging events on a per-schema basis.


Thanks,


-Sam


[0] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37