generating aggregates from the page preview events, and then joining
the
page preview aggregates with the >pageview aggregates into a new table with
an extra dimension specifying which type of content view was made.
On my opinion the aggregated data should stay in two different tables. I
can see a future where the preview data is of different types (might
include rich media that was/was not played, there are simple popups and
"richer" ones .. whatever) and the dimensions where you represent this
consumption are not going to match with pageview_hourly which again only
represents well full page loads.
On Tue, Jan 30, 2018 at 12:02 AM, Andrew Otto <otto(a)wikimedia.org> wrote:
CoOOOl :)
Using the GeoIP cookie will require reconfiguring
the EventLogging
varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would
be sent back to by the client in the event. E.g. event.country =
response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to
geocode or populate the country in the event.
However, if y’all can’t or don’t want to use the country cookie, then
yaaa, we gotta figure out what to do about IPs and geocoding in
EventLogging. There are a few options here, but none of them are great. The
options basically are variations on ‘treat this event schema as special and
make special conditionals in EventLogging processor code’, or, 'include IP
and/or geocode all events in all schemas'. We’re not sure which we want to
do yet, but we did mention this at our offsite today. I think we’ll figure
this out and make it happen in the next week or two. Whatever the
implementation ends up being, we’ll get geocoded data into this dataset.
Is the geocoding code that we use on
webrequest_raw available as an
Hive UDF or in PySpark?
The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive
UDF
<https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/GetGeoDataUDF.java>
which ultimately just calls this getGeocodedData
<https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Geocode.java#L138>
function, which itself is just a wrapper around the Maxmind API. We may end
up doing geocoding in the EventLogging server codebase (again, really not
sure about this yet…), but if we do it will use the same Maxmind databases.
Aggregating the EventLogging data in the same way
that we aggregate
webrequest data into pageviews data will require either:
replicating the
process that does this and keeping the two processes in sync; or
abstracting away the source table from the aggregation process so that it
can work on both tables
I’m not totally sure if this works for you all, but I had pictured
generating aggregates from the page preview events, and then joining the
page preview aggregates with the pageview aggregates into a new table with
an extra dimension specifying which type of content view was made.
I’d appreciate it if someone could estimate how
much work it will be
to implement GeoIP information and the other fields from
Pageview hourly
for EventLogging events
Ya we gotta figure this out still, but actual implementation shouldn’t be
difficult, however we decide to do it.
On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith <samsmith(a)wikimedia.org>
wrote:
*Hullo all,It seems like we've arrived at an implementation for the
client-side (JS) part of this problem: use EventLogging to track a page
interaction from within the Page Previews code. This'll give us the
flexibility to take advantage of a stream processing solution if/when it
becomes available, to push the definition of a "Page Previews page
interaction" to the client, and to rely on any events that we log in the
immediate future ending up in tables that we're already familiar with.In
principle, I agree with Andrew's argument that adding additional filtering
logic to the webrequest refinement process will make it harder to change
existing definitions of views or add others in future. In practice though,
we'll need to: - Ensure that the server-side EventLogging component records
metadata consistent with with our existing content consumption measurement,
concretely: the fields available in the
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_ho…
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly>
table. In particular, that it either doesn't discard the client IP or
utilizes the GeoIP cookie sent by the client for this schema.- Aggregate
the resulting table so that it can be combined with the pageviews table to
generate reports.- Ensure that the events aren't recorded in MySQL.Using
the GeoIP cookie will require reconfiguring the EventLogging varnishkafka
instance [0], and raises questions about the compatibility with the
corresponding field in the pageviews data. Retaining the client IP will
require a similar change but will also require that we share the geocoding
code with whatever process we use to refine the data that we’re capturing
via EventLogging. Is the geocoding code that we use on webrequest_raw
available as an Hive UDF or in PySpark?Aggregating the EventLogging data in
the same way that we aggregate webrequest data into pageviews data will
require either: replicating the process that does this and keeping the two
processes in sync; or abstracting away the source table from the
aggregation process so that it can work on both tables. We’ll have to
maintain the chosen approach until it’s superseded by a stream processing
solution, the timeline of which is currently measured in years.My next
steps are making sure that Audiences Product's requirements are all visible
and to work with Tilman Bayer to create a schema that's suitable for our
purposes but hopefully useful to others. Nuria has also offered to give a
technical overview of EventLogging, which I think would be a great resource
for everyone so I'll look into setting up a meeting. I'd appreciate it if
someone could estimate how much work it will be to implement GeoIP
information and the other fields from Pageview hourly for EventLogging
events on a per-schema basis.Thanks,-Sam[0]
https://phabricator.wikimedia.org/source/operations-puppet/browse/productio…
<https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37>*
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics