On Tue, Jan 30, 2018 at 12:02 AM, Andrew Otto <otto@wikimedia.org> wrote:

CoOOOl :)

> Using the GeoIP cookie will require reconfiguring the EventLogging varnishkafka instance [0]

I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);

That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.

However, if y’all can’t or don’t want to use the country cookie, then yaaa, we gotta figure out what to do about IPs and geocoding in EventLogging. There are a few options here, but none of them are great. The options basically are variations on ‘treat this event schema as special and make special conditionals in EventLogging processor code’, or, 'include IP and/or geocode all events in all schemas'. We’re not sure which we want to do yet, but we did mention this at our offsite today. I think we’ll figure this out and make it happen in the next week or two. Whatever the implementation ends up being, we’ll get geocoded data into this dataset.

> Is the geocoding code that we use on webrequest_raw available as an Hive UDF or in PySpark?
The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive UDF which ultimately just calls this getGeocodedData function, which itself is just a wrapper around the Maxmind API. We may end up doing geocoding in the EventLogging server codebase (again, really not sure about this yet…), but if we do it will use the same Maxmind databases.

> Aggregating the EventLogging data in the same way that we aggregate webrequest data into pageviews data will require either: replicating the process that does this and keeping the two processes in sync; or abstracting away the source table from the aggregation process so that it can work on both tables

I’m not totally sure if this works for you all, but I had pictured generating aggregates from the page preview events, and then joining the page preview aggregates with the pageview aggregates into a new table with an extra dimension specifying which type of content view was made.

> I’d appreciate it if someone could estimate how much work it will be to implement GeoIP information and the other fields from Pageview hourly for EventLogging events

Ya we gotta figure this out still, but actual implementation shouldn’t be difficult, however we decide to do it.

On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith <samsmith@wikimedia.org> wrote:
Hullo all,

It seems like we've arrived at an implementation for the client-side (JS) part of this problem: use EventLogging to track a page interaction from within the Page Previews code. This'll give us the flexibility to take advantage of a stream processing solution if/when it becomes available, to push the definition of a "Page Previews page interaction" to the client, and to rely on any events that we log in the immediate future ending up in tables that we're already familiar with.

In principle, I agree with Andrew's argument that adding additional filtering logic to the webrequest refinement process will make it harder to change existing definitions of views or add others in future. In practice though, we'll need to:

Ensure that the server-side EventLogging component records metadata consistent with with our existing content consumption measurement, concretely: the fields available in the https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly table. In particular, that it either doesn't discard the client IP or utilizes the GeoIP cookie sent by the client for this schema.
Aggregate the resulting table so that it can be combined with the pageviews table to generate reports.
Ensure that the events aren't recorded in MySQL.

Using the GeoIP cookie will require reconfiguring the EventLogging varnishkafka instance [0], and raises questions about the compatibility with the corresponding field in the pageviews data. Retaining the client IP will require a similar change but will also require that we share the geocoding code with whatever process we use to refine the data that we’re capturing via EventLogging. Is the geocoding code that we use on webrequest_raw available as an Hive UDF or in PySpark?

Aggregating the EventLogging data in the same way that we aggregate webrequest data into pageviews data will require either: replicating the process that does this and keeping the two processes in sync; or abstracting away the source table from the aggregation process so that it can work on both tables. We’ll have to maintain the chosen approach until it’s superseded by a stream processing solution, the timeline of which is currently measured in years.

My next steps are making sure that Audiences Product's requirements are all visible and to work with Tilman Bayer to create a schema that's suitable for our purposes but hopefully useful to others. Nuria has also offered to give a technical overview of EventLogging, which I think would be a great resource for everyone so I'll look into setting up a meeting. I'd appreciate it if someone could estimate how much work it will be to implement GeoIP information and the other fields from Pageview hourly for EventLogging events on a per-schema basis.

Thanks,

-Sam

[0] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics