Hullo,
Page Previews is now fully deployed to all but 2 of the Wikipedias. In deploying it, we've created a new way to interact with pages without navigating to them. This impacts the overall and per-page pageviews metrics that are used in myriad reports, e.g. to editors about the readership of their articles and in monthly reports to the board. Consequently, we need to be able to report a user reading the preview of a page just like we do them navigating to it.
Readers Web are planning to instrument Page Previews such that when a preview is available and open for longer than X ms, a "page interaction" is recorded. We're aware of a couple of mechanisms for recording something like this from the client:
1. All files viewed with the media viewer are recorded by the client requesting the /beacon/media?duration=X&uri=Y URL at some point [0] – as Nuria points out in that thread, requests to /beacon/... are already filtered and a canned response is sent immediately by Varnish [1]. 2. Requesting a URL with the X-Analytics header [2] set to "preview". In this context, we'd make a HEAD request to the URL of the page with the header set.
IMO #1 is preferable from the operations and performance perspectives as the response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if the user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already
We're currently considering recording page interactions when previews are open for longer than 1000 ms. We estimate that this would increase overall web requests by 0.3% [3].
Are there other ways of recording this information? We're fairly confident that #1 seems like the best choice here but it's referred to as the "virtual file view hack". Is this really the case? Moreover, should we request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we consolidate the URLs as both represent the same thing essentially?
Thanks,
-Sam
Timezone: GMT IRC (Freenode): phuedx
[0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html [1] *https://phabricator.wikimedia.org/source/operations-puppet/browse/production... https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269* [2] https://wikitech.wikimedia.org/wiki/X-Analytics [3] https://phabricator.wikimedia.org/T184793#3901365
Hi Sam,
On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith samsmith@wikimedia.org wrote:
IMO #1 is preferable from the operations and performance perspectives as the response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if the user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already
It seems the sentence above is cut, can you resend it?
We're currently considering recording page interactions when previews are open for longer than 1000 ms. We estimate that this would increase overall web requests by 0.3% [3].
Can you say some words about how the 1000 ms threshold is chosen? Is this based (partially) on looking at traces where a user-agent goes to a page and returns to the "source" article?
Thanks, Leila
[0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html [1] https://phabricator.wikimedia.org/source/operations-puppet/browse/production... [2] https://wikitech.wikimedia.org/wiki/X-Analytics [3] https://phabricator.wikimedia.org/T184793#3901365
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Jan 17, 2018 at 6:46 PM, Leila Zia leila@wikimedia.org wrote:
On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith samsmith@wikimedia.org wrote:
IMO #1 is preferable from the operations and performance perspectives as
the
response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if the user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already
It seems the sentence above is cut, can you resend it?
Hah! I should've cut the whole sentence.
I was going to make the point that #2 already has a processing pipeline established whereas #1 doesn't. AIUI there'd have to be a refinement step added to Oozie to process the requests in #1 whereas the requests in #2 make it into the webrequest table with the appropriate value in the x_analytics column.
Thanks,
-Sam
Hi Leila,
On Wed, Jan 17, 2018 at 10:46 AM, Leila Zia leila@wikimedia.org wrote:
Hi Sam,
On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith samsmith@wikimedia.org wrote:
IMO #1 is preferable from the operations and performance perspectives as
the
response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if the user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already
It seems the sentence above is cut, can you resend it?
We're currently considering recording page interactions when previews are open for longer than 1000 ms. We estimate that this would increase
overall
web requests by 0.3% [3].
Can you say some words about how the 1000 ms threshold is chosen?
This is a good question, sorry that it got buried earlier. (It's kind of orthogonal though to the technical instrumentation questions that have been the focus of attention: as indicated by the capital X in Sam's initial post, we can still decide to fine-tune that threshold right now, it's just a parameter change.)
This kind of threshold necessarily needs to be set somewhat arbitrarily, in the sense that there will always be either cases where some content was already read/perceived in a preview card shown for a shorter time, or cases where a reader needed a longer time to consume any content from the card. We picked a time by which we can be reasonably certain that at least some readers can consume content (read some words, perceive an image). It's not the result of an exact calculation to find the provably best limit. But we did have look at the frequency of the different user actions over time during the first seconds after they start to hover over a link. In case you're interested, I recently updated those charts with better quality data from our latest two tests, e.g: https://phabricator.wikimedia.org/F12940888 https://phabricator.wikimedia.org/F13134460 (a zoomed-in look at the same histogram)
The following is just eyeballing and thinking aloud, but one way to view this histogram is as the sum of several distributions associated with different user intentions: 1. Most of the time when our instrumentation registered the cursor moving over a link, the user was just on their way to a different part of the screen (with no intention of either clicking that link or viewing the preview). That's mostly the huge yellow spike on the left - "dwelledButAbandoned" meaning that the cursor left the link without either clicking it or causing a preview to show. The feature involves a 500ms delay before the preview card begins to display, so that we don't bother that group too much. (Only the right tail end of that distribution, folks moving the cursor very slowly, will be affected, where things morph from yellow into purple.) 2. Then there are users who want to click the link without viewing the preview, forming all of the green part left of 500ms and an unknown portion to the right of it (after the card starts to show, some of these "open" actions will instead happen after the user intentionally viewed the card, case 3.). 3. And there are users who intentionally view a preview. The little bump in the purple part ("dismissed" meaning that the preview was shown and then closed by moving the cursor away) at about 1100ms indicates that the distribution for that user group also peaks somewhere there, maybe a few 100ms to the right. That would mean that our 1000ms threshold (i.e. only counting the part of the histogram right of 1500ms = 500ms + 1000ms as seen previews) is actually right of that distribution's peak. I.e. that the threshold is in some sense quite conservative.
Like I said, this is all of course still a bit handwavy; it involves some assumptions about the form of these distributions, as well as disregarding some other information for now that can give a fuller picture (in particular the analogous histogram for link interaction behavior without page previews being active, which we also have from our A/B tests).
Is this based (partially) on looking at traces where a user-agent goes to a page and returns to the "source" article?
We did an analysis of that user behavior, but not regarding the timing question; rather, it was about finding out how much of the reduction in pageviews comes from reduced usage of the back button. I'm not sure how directly we can compare the action of loading an entire new page and then going back (two clicks that also involve moving the mouse cursor to an entirely different part of the screen - the back button - inbetween) with the action of hovering over a link and then moving the cursor away for a small distance to close the preview; it seems to me that the latter involves much less friction - which is kind of the whole point of the previews feature ;)
As indicated, we already picked a value for the threshold that we are quite comfortable with. But if you are still interested in this question and have some spare time, I'm more than happy to chat about it further off-list.
Thanks, Leila
[0] https://lists.wikimedia.org/pipermail/analytics/2015-March/0
03633.html
[1] https://phabricator.wikimedia.org/source/operations-puppet/b
rowse/production/modules/varnish/templates/vcl/wikimedia-fro ntend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269
[2] https://wikitech.wikimedia.org/wiki/X-Analytics [3] https://phabricator.wikimedia.org/T184793#3901365
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thank you, Tilman. This is very helpful.
Leila
On Thu, Feb 8, 2018 at 1:50 AM, Tilman Bayer tbayer@wikimedia.org wrote:
Hi Leila,
On Wed, Jan 17, 2018 at 10:46 AM, Leila Zia leila@wikimedia.org wrote:
Hi Sam,
On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith samsmith@wikimedia.org wrote:
IMO #1 is preferable from the operations and performance perspectives
as the
response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if
the
user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already
It seems the sentence above is cut, can you resend it?
We're currently considering recording page interactions when previews
are
open for longer than 1000 ms. We estimate that this would increase
overall
web requests by 0.3% [3].
Can you say some words about how the 1000 ms threshold is chosen?
This is a good question, sorry that it got buried earlier. (It's kind of orthogonal though to the technical instrumentation questions that have been the focus of attention: as indicated by the capital X in Sam's initial post, we can still decide to fine-tune that threshold right now, it's just a parameter change.)
This kind of threshold necessarily needs to be set somewhat arbitrarily, in the sense that there will always be either cases where some content was already read/perceived in a preview card shown for a shorter time, or cases where a reader needed a longer time to consume any content from the card. We picked a time by which we can be reasonably certain that at least some readers can consume content (read some words, perceive an image). It's not the result of an exact calculation to find the provably best limit. But we did have look at the frequency of the different user actions over time during the first seconds after they start to hover over a link. In case you're interested, I recently updated those charts with better quality data from our latest two tests, e.g: https://phabricator.wikimedia.org/F12940888 https://phabricator.wikimedia.org/F13134460 (a zoomed-in look at the same histogram)
The following is just eyeballing and thinking aloud, but one way to view this histogram is as the sum of several distributions associated with different user intentions:
- Most of the time when our instrumentation registered the cursor moving
over a link, the user was just on their way to a different part of the screen (with no intention of either clicking that link or viewing the preview). That's mostly the huge yellow spike on the left - "dwelledButAbandoned" meaning that the cursor left the link without either clicking it or causing a preview to show. The feature involves a 500ms delay before the preview card begins to display, so that we don't bother that group too much. (Only the right tail end of that distribution, folks moving the cursor very slowly, will be affected, where things morph from yellow into purple.) 2. Then there are users who want to click the link without viewing the preview, forming all of the green part left of 500ms and an unknown portion to the right of it (after the card starts to show, some of these "open" actions will instead happen after the user intentionally viewed the card, case 3.). 3. And there are users who intentionally view a preview. The little bump in the purple part ("dismissed" meaning that the preview was shown and then closed by moving the cursor away) at about 1100ms indicates that the distribution for that user group also peaks somewhere there, maybe a few 100ms to the right. That would mean that our 1000ms threshold (i.e. only counting the part of the histogram right of 1500ms = 500ms + 1000ms as seen previews) is actually right of that distribution's peak. I.e. that the threshold is in some sense quite conservative.
Like I said, this is all of course still a bit handwavy; it involves some assumptions about the form of these distributions, as well as disregarding some other information for now that can give a fuller picture (in particular the analogous histogram for link interaction behavior without page previews being active, which we also have from our A/B tests).
Is this based (partially) on looking at traces where a user-agent goes to a page and returns to the "source" article?
We did an analysis of that user behavior, but not regarding the timing question; rather, it was about finding out how much of the reduction in pageviews comes from reduced usage of the back button. I'm not sure how directly we can compare the action of loading an entire new page and then going back (two clicks that also involve moving the mouse cursor to an entirely different part of the screen - the back button - inbetween) with the action of hovering over a link and then moving the cursor away for a small distance to close the preview; it seems to me that the latter involves much less friction - which is kind of the whole point of the previews feature ;)
As indicated, we already picked a value for the threshold that we are quite comfortable with. But if you are still interested in this question and have some spare time, I'm more than happy to chat about it further off-list.
Thanks, Leila
[0] https://lists.wikimedia.org/pipermail/analytics/2015-March/0
03633.html
[1] https://phabricator.wikimedia.org/source/operations-puppet/b
rowse/production/modules/varnish/templates/vcl/wikimedia-fro ntend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269
[2] https://wikitech.wikimedia.org/wiki/X-Analytics [3] https://phabricator.wikimedia.org/T184793#3901365
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
(Moving ops list to bcc)
Are there other ways of recording this information? We're fairly confident
that #1 seems like the best choice here but it's referred to as the "virtual file view hack". Is this really the case? Yes, there are, please use eventlogging.
Recording "preview_events" is really no different that recording any other kind of UI event, difference is going to come from scale if anything, as they are probably tens of thousands of those per second (I think your team already estimated volume, if so please send those estimates along)
We discourage you from sending events directly to beacon. Rather, use the EL client to send a page-preview event defined in a given schema. This is a similar approach as to how we will be measuring banner impressions for fundraising banners in the future.
Thanks,
Nuria
On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith samsmith@wikimedia.org wrote:
Hullo,
Page Previews is now fully deployed to all but 2 of the Wikipedias. In deploying it, we've created a new way to interact with pages without navigating to them. This impacts the overall and per-page pageviews metrics that are used in myriad reports, e.g. to editors about the readership of their articles and in monthly reports to the board. Consequently, we need to be able to report a user reading the preview of a page just like we do them navigating to it.
Readers Web are planning to instrument Page Previews such that when a preview is available and open for longer than X ms, a "page interaction" is recorded. We're aware of a couple of mechanisms for recording something like this from the client:
- All files viewed with the media viewer are recorded by the client
requesting the /beacon/media?duration=X&uri=Y URL at some point [0] – as Nuria points out in that thread, requests to /beacon/... are already filtered and a canned response is sent immediately by Varnish [1]. 2. Requesting a URL with the X-Analytics header [2] set to "preview". In this context, we'd make a HEAD request to the URL of the page with the header set.
IMO #1 is preferable from the operations and performance perspectives as the response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if the user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already
We're currently considering recording page interactions when previews are open for longer than 1000 ms. We estimate that this would increase overall web requests by 0.3% [3].
Are there other ways of recording this information? We're fairly confident that #1 seems like the best choice here but it's referred to as the "virtual file view hack". Is this really the case? Moreover, should we request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we consolidate the URLs as both represent the same thing essentially?
Thanks,
-Sam
Timezone: GMT IRC (Freenode): phuedx
[0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html [1] *https://phabricator.wikimedia.org/source/operations-puppet/browse/production... https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269* [2] https://wikitech.wikimedia.org/wiki/X-Analytics [3] https://phabricator.wikimedia.org/T184793#3901365
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Jan 17, 2018 at 10:54 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Recording "preview_events" is really no different that recording any other kind of UI event, difference is going to come from scale if anything, as they are probably tens of thousands of those per second (I think your team already estimated volume, if so please send those estimates along)
Conceptually I think a virtual pageview is a different thing from a UI event (which is how e.g. Google Analytics handles it, there is a method to send an event for the current page and a different method to send a virtual pageview for a different page), and the ideal way it is exposed in an analytics system should be very different. (I would want to see virtual pageviews together with normal pageviews, with some filtering option. If I deploy code that shows previews and converts users from making real pageviews to making virtual pageviews, I want to see how the total pageviews changed in the normal pageview stats; I don't want to have to create that chart and export one dataset from pageviews and one dataset from eventlogging to do that. As a user, I want to see in the fileview API how many people looked at the photo I uploaded, I don't particularly care if they used MediaViewer or not. etc.)
So maybe it's worth considering which approach takes us closer to that? AIUI the beacon puts the record into the webrequest table and from there it would only take some trivial preprocessing to replace the beacon URL with the virtual URL and and add the beacon type as a "virtual_type" field or something, making it very easy to expose it everywhere where views are tracked, while EventLogging data gets stored in a different, unrelated way.
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/?preselect_filters=%7B%7...
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
https://pivot.wikimedia.org/#tbayer_popups
I was going to make the point that #2 already has a processing pipeline
established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-... Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
the beacon puts the record into the webrequest table and from there it
would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
tracking of events is better done on an event based system and EL is such
a system. I agree with this too. We really want to discourage people from trying to measure things by searching through the huge haystack of all webrequests. To measure something, you should emit an event if you can. If it were practical, I’d prefer that we did this for pageviews as well. Currently, we need a complicated definition of what a pageview is, which really only exists in the Java implementation in the Hadoop cluster. It’d be much clearer if app developers had a way to define themselves what counts as a pageview, and emit that as an event.
This should be the approach that people take when they want to measure something new. Emit an event! This event will get its own Kafka topic (you can consume this to do whatever you like with it), and be refined into its own Hive table.
I don’t want to have to create that chart and export one dataset from
pageviews and one dataset from eventlogging to do that. If you also design your schema nicely https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines, it will be easily importable into Druid and usable in Pivot and Superset, alongside of pageviews. We’re working on getting nice schemas automatically imported into druid https://gerrit.wikimedia.org/r/#/c/386882/.
On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/? preselect_filters=%7B%7D
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
https://pivot.wikimedia.org/#tbayer_popups
I was going to make the point that #2 already has a processing pipeline
established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see: https://github.com/wikimedia/analytics-refinery-source/ blob/master/refinery-core/src/main/java/org/wikimedia/ analytics/refinery/core/PageviewDefinition.java#L144 Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi all,
I just want to confirm that the proposed method using Eventlogging will allow us to gather data in a similar fashion to the web request table. In particular, will we be able to sort by country, OS, Browser, etc? Our goal here is to be able to consider the new page interactions metric on the same level and with the same depth as pageviews.
Thanks!
- Olga
On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto otto@wikimedia.org wrote:
the beacon puts the record into the webrequest table and from there it
would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
tracking of events is better done on an event based system and EL is
such a system. I agree with this too. We really want to discourage people from trying to measure things by searching through the huge haystack of all webrequests. To measure something, you should emit an event if you can. If it were practical, I’d prefer that we did this for pageviews as well. Currently, we need a complicated definition of what a pageview is, which really only exists in the Java implementation in the Hadoop cluster. It’d be much clearer if app developers had a way to define themselves what counts as a pageview, and emit that as an event.
This should be the approach that people take when they want to measure something new. Emit an event! This event will get its own Kafka topic (you can consume this to do whatever you like with it), and be refined into its own Hive table.
I don’t want to have to create that chart and export one dataset from
pageviews and one dataset from eventlogging to do that. If you also design your schema nicely https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines, it will be easily importable into Druid and usable in Pivot and Superset, alongside of pageviews. We’re working on getting nice schemas automatically imported into druid https://gerrit.wikimedia.org/r/#/c/386882/.
On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/?preselect_filters=%7B%7...
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
https://pivot.wikimedia.org/#tbayer_popups
I was going to make the point that #2 already has a processing pipeline
established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-... Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
In particular, will we be able to sort by country, OS, Browser, etc?
OS, Browser, yes. User Agent parsing is done by the EventLogging processors.
Country not quite as easily, as EventLogging does not include client IP addresses. We could consider putting this back in somehow, or, I’ve also heard that there is a geocoded country cookie that varnish will set that the browser could send back as part of the event. Is country enough geo detail?
On Thu, Jan 18, 2018 at 2:30 PM, Olga Vasileva ovasileva@wikimedia.org wrote:
Hi all,
I just want to confirm that the proposed method using Eventlogging will allow us to gather data in a similar fashion to the web request table. In particular, will we be able to sort by country, OS, Browser, etc? Our goal here is to be able to consider the new page interactions metric on the same level and with the same depth as pageviews.
Thanks!
- Olga
On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto otto@wikimedia.org wrote:
the beacon puts the record into the webrequest table and from there it
would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
tracking of events is better done on an event based system and EL is
such a system. I agree with this too. We really want to discourage people from trying to measure things by searching through the huge haystack of all webrequests. To measure something, you should emit an event if you can. If it were practical, I’d prefer that we did this for pageviews as well. Currently, we need a complicated definition of what a pageview is, which really only exists in the Java implementation in the Hadoop cluster. It’d be much clearer if app developers had a way to define themselves what counts as a pageview, and emit that as an event.
This should be the approach that people take when they want to measure something new. Emit an event! This event will get its own Kafka topic (you can consume this to do whatever you like with it), and be refined into its own Hive table.
I don’t want to have to create that chart and export one dataset from
pageviews and one dataset from eventlogging to do that. If you also design your schema nicely https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines, it will be easily importable into Druid and usable in Pivot and Superset, alongside of pageviews. We’re working on getting nice schemas automatically imported into druid https://gerrit.wikimedia.org/r/#/c/386882/.
On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/? preselect_filters=%7B%7D
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
https://pivot.wikimedia.org/#tbayer_popups
I was going to make the point that #2 already has a processing
pipeline established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see: https://github.com/wikimedia/analytics-refinery-source/ blob/master/refinery-core/src/main/java/org/wikimedia/ analytics/refinery/core/PageviewDefinition.java#L144 Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Olga Vasileva // Product Manager // Reading Web Team https://wikimediafoundation.org/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
(I'd defer to the Readers Web team with Tilman on whether country extracted from the cookie would be sufficient.)
Adding to this, one thing to consider is DNT - is there a way to invoke EL so that such traffic is appropriately imputed or something?
-Adam
On Thu, Jan 18, 2018 at 2:13 PM, Andrew Otto otto@wikimedia.org wrote:
In particular, will we be able to sort by country, OS, Browser, etc?
OS, Browser, yes. User Agent parsing is done by the EventLogging processors.
Country not quite as easily, as EventLogging does not include client IP addresses. We could consider putting this back in somehow, or, I’ve also heard that there is a geocoded country cookie that varnish will set that the browser could send back as part of the event. Is country enough geo detail?
On Thu, Jan 18, 2018 at 2:30 PM, Olga Vasileva ovasileva@wikimedia.org wrote:
Hi all,
I just want to confirm that the proposed method using Eventlogging will allow us to gather data in a similar fashion to the web request table. In particular, will we be able to sort by country, OS, Browser, etc? Our goal here is to be able to consider the new page interactions metric on the same level and with the same depth as pageviews.
Thanks!
- Olga
On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto otto@wikimedia.org wrote:
the beacon puts the record into the webrequest table and from there
it would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
tracking of events is better done on an event based system and EL is
such a system. I agree with this too. We really want to discourage people from trying to measure things by searching through the huge haystack of all webrequests. To measure something, you should emit an event if you can. If it were practical, I’d prefer that we did this for pageviews as well. Currently, we need a complicated definition of what a pageview is, which really only exists in the Java implementation in the Hadoop cluster. It’d be much clearer if app developers had a way to define themselves what counts as a pageview, and emit that as an event.
This should be the approach that people take when they want to measure something new. Emit an event! This event will get its own Kafka topic (you can consume this to do whatever you like with it), and be refined into its own Hive table.
I don’t want to have to create that chart and export one dataset
from pageviews and one dataset from eventlogging to do that. If you also design your schema nicely https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines, it will be easily importable into Druid and usable in Pivot and Superset, alongside of pageviews. We’re working on getting nice schemas automatically imported into druid https://gerrit.wikimedia.org/r/#/c/386882/.
On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/?presele ct_filters=%7B%7D
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
https://pivot.wikimedia.org/#tbayer_popups
I was going to make the point that #2 already has a processing
pipeline established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see: https://github.com/wikimedia/analytics-refinery-source/blob/ master/refinery-core/src/main/java/org/wikimedia/analytics/ refinery/core/PageviewDefinition.java#L144 Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Olga Vasileva // Product Manager // Reading Web Team https://wikimediafoundation.org/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Adding to this, one thing to consider is DNT - is there a way to invoke EL
so that such traffic is appropriately imputed or something?
I am not sure what you are asking ...
On Thu, Jan 18, 2018 at 1:57 PM, Adam Baso abaso@wikimedia.org wrote:
(I'd defer to the Readers Web team with Tilman on whether country extracted from the cookie would be sufficient.)
Adding to this, one thing to consider is DNT - is there a way to invoke EL so that such traffic is appropriately imputed or something?
-Adam
On Thu, Jan 18, 2018 at 2:13 PM, Andrew Otto otto@wikimedia.org wrote:
In particular, will we be able to sort by country, OS, Browser, etc?
OS, Browser, yes. User Agent parsing is done by the EventLogging processors.
Country not quite as easily, as EventLogging does not include client IP addresses. We could consider putting this back in somehow, or, I’ve also heard that there is a geocoded country cookie that varnish will set that the browser could send back as part of the event. Is country enough geo detail?
On Thu, Jan 18, 2018 at 2:30 PM, Olga Vasileva ovasileva@wikimedia.org wrote:
Hi all,
I just want to confirm that the proposed method using Eventlogging will allow us to gather data in a similar fashion to the web request table. In particular, will we be able to sort by country, OS, Browser, etc? Our goal here is to be able to consider the new page interactions metric on the same level and with the same depth as pageviews.
Thanks!
- Olga
On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto otto@wikimedia.org wrote:
the beacon puts the record into the webrequest table and from there
it would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
tracking of events is better done on an event based system and EL is
such a system. I agree with this too. We really want to discourage people from trying to measure things by searching through the huge haystack of all webrequests. To measure something, you should emit an event if you can. If it were practical, I’d prefer that we did this for pageviews as well. Currently, we need a complicated definition of what a pageview is, which really only exists in the Java implementation in the Hadoop cluster. It’d be much clearer if app developers had a way to define themselves what counts as a pageview, and emit that as an event.
This should be the approach that people take when they want to measure something new. Emit an event! This event will get its own Kafka topic (you can consume this to do whatever you like with it), and be refined into its own Hive table.
I don’t want to have to create that chart and export one dataset
from pageviews and one dataset from eventlogging to do that. If you also design your schema nicely https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines, it will be easily importable into Druid and usable in Pivot and Superset, alongside of pageviews. We’re working on getting nice schemas automatically imported into druid https://gerrit.wikimedia.org/r/#/c/386882/.
On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/?presele ct_filters=%7B%7D
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
https://pivot.wikimedia.org/#tbayer_popups
I was going to make the point that #2 already has a processing
pipeline established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see: https://github.com/wikimedia/analytics-refinery-source/blob/ master/refinery-core/src/main/java/org/wikimedia/analytics/r efinery/core/PageviewDefinition.java#L144 Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Olga Vasileva // Product Manager // Reading Web Team https://wikimediafoundation.org/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso abaso@wikimedia.org wrote:
Adding to this, one thing to consider is DNT - is there a way to invoke EL so that such traffic is appropriately imputed or something?
The EventLogging client respects DNT [0]. When the user enables DNT, mw.eventLog.logEvent is a NOP.
I don't see any mention of DNT in the Varnish VCLs around the the /beacon endpoint or otherwise but it may be handled elsewhere. While it's unlikely, there's nothing stopping a client sending a well-formatted request to the /beacon/event endpoint directly [1], ignoring the user's choice.
-Sam
[0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/master/modules/ext.e... [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/master/modules/ext.e...
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS library would some sort of new method be needed so that these impressions arena't undercounted?
On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith samsmith@wikimedia.org wrote:
On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso abaso@wikimedia.org wrote:
Adding to this, one thing to consider is DNT - is there a way to invoke EL so that such traffic is appropriately imputed or something?
The EventLogging client respects DNT [0]. When the user enables DNT, mw.eventLog.logEvent is a NOP.
I don't see any mention of DNT in the Varnish VCLs around the the /beacon endpoint or otherwise but it may be handled elsewhere. While it's unlikely, there's nothing stopping a client sending a well-formatted request to the /beacon/event endpoint directly [1], ignoring the user's choice.
-Sam
[0] https://phabricator.wikimedia.org/diffusion/EEVL/ browse/master/modules/ext.eventLogging.core.js; 4480f7e27140fcb8ae915c1755223fd7a5bab9b9$251 [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/ master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c1755223f d7a5bab9b9$215
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
For virtual pageviews, people will probably be more interested in
reports that belong to the first group (summing them up with normal pageviews, breaking them down along the dimensions that are relevant for web traffic, counting them for a given URL etc).
Ah! Ok I get this use case now. I might not be able to comment about this much then. I think this totally changes the meaning of a pageview. Perhaps this is what you want? If so, this is outside the realm of my opinionatedness. :)
However, IF you do convince folks to change the meaning of ‘pageview’ to include ‘previews’, then we might be able to compromise. All I object to more filtering of webrequests :) The rest of this email might be moot if we don’t change the ‘pageview definition’, but I’ll continue anyway…
The page previews data could come in as events. Augmenting the generated pageviews table from more incoming event sources sounds more flexible than doing more filtering logic in webrequests. I’d defer to the Analytics team members who would be implementing this though, I might be wrong.
In my ideal, pageviews and page_previews would both be separate event streams. These would be imported as is to Hive tables, but also available in Kafka. You could join these together in a broader ‘content consumption’ dataset somehow, either in Hadoop with batch jobs, or more realtime with streaming jobs. (If this is done right, you can even use the same code for both cases.) If we had a good stream processing system here, I might suggest that we move pageview filtering to a more realtime setup and generate a derived pageview stream in Kafka. We’d then that as the source of pageviews in Hadoop. Anyway, this is my ideal setup, but not what we have now! But we might one day (in the next FY???), and intaking events for page previews and other counters will help us migrate to this kind of architecture later.
Is that different from preprocessing them via EventLogging? Either way
you take a HTTP request, and end up with a Hadoop record - is there something that makes that process a lot more costly for normal pageviews than EventLogging beacon hits?
From a hardware perspective, only in that the stream of events is much
smaller, so there’s less wasted repeated I/O. From a engineering time perspective, if we use the webrequest tagging system to do this, I think we’re good, but only in the short term. In the long term, it hides the complexity involved in maintaining the logic of what a pageview or page preview or any other ‘tagged’ webrequest in complicated Java logic that is really only useable in Hadoop. I’m mainly objecting because we want to draw a line to stop doing this kind of thing. Doing this for page previews now might be ok if we really really really have to (although Nuria might not agree ;) ), but ultimately we need to push this kind of interaction logic out to feature developers who have more control over it.
The Analytics team wants to build infrastructure that make it easy for developers to measure their product usage, not implement the measuring logic ourselves.
On Fri, Jan 19, 2018 at 6:05 AM, Adam Baso abaso@wikimedia.org wrote:
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS library would some sort of new method be needed so that these impressions arena't undercounted?
On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith samsmith@wikimedia.org wrote:
On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso abaso@wikimedia.org wrote:
Adding to this, one thing to consider is DNT - is there a way to invoke EL so that such traffic is appropriately imputed or something?
The EventLogging client respects DNT [0]. When the user enables DNT, mw.eventLog.logEvent is a NOP.
I don't see any mention of DNT in the Varnish VCLs around the the /beacon endpoint or otherwise but it may be handled elsewhere. While it's unlikely, there's nothing stopping a client sending a well-formatted request to the /beacon/event endpoint directly [1], ignoring the user's choice.
-Sam
[0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/ master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91 5c1755223fd7a5bab9b9$251 [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c 1755223fd7a5bab9b9$215
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
library would some sort of new method be needed so that these impressions arena't undercounted? If we had a lot of users with DNT, maybe, from our tests when we enabled that on EL this is not the case. Your team has already run experiments on this functionality and they can speak as to the projection of numbers.
On Fri, Jan 19, 2018 at 3:05 AM, Adam Baso abaso@wikimedia.org wrote:
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS library would some sort of new method be needed so that these impressions arena't undercounted?
On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith samsmith@wikimedia.org wrote:
On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso abaso@wikimedia.org wrote:
Adding to this, one thing to consider is DNT - is there a way to invoke EL so that such traffic is appropriately imputed or something?
The EventLogging client respects DNT [0]. When the user enables DNT, mw.eventLog.logEvent is a NOP.
I don't see any mention of DNT in the Varnish VCLs around the the /beacon endpoint or otherwise but it may be handled elsewhere. While it's unlikely, there's nothing stopping a client sending a well-formatted request to the /beacon/event endpoint directly [1], ignoring the user's choice.
-Sam
[0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/ master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91 5c1755223fd7a5bab9b9$251 [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c 1755223fd7a5bab9b9$215
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
You could join these together in a broader ‘content consumption’ dataset
somehow, either in Hadoop with batch jobs, or more realtime with streaming jobs.
Hm, idea…which I think has been mentioned before: Could we leave pageviews as is, but make a new dataset that counts both pageviews and page previews? Maybe this is ‘content_views’? We could explicitly state that the definition of content_views is supposed to change with time, and could possibly incorporate other future types of content views too. Eh?
On Fri, Jan 19, 2018 at 12:27 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
library would some sort of new method be needed so that these impressions arena't undercounted? If we had a lot of users with DNT, maybe, from our tests when we enabled that on EL this is not the case. Your team has already run experiments on this functionality and they can speak as to the projection of numbers.
On Fri, Jan 19, 2018 at 3:05 AM, Adam Baso abaso@wikimedia.org wrote:
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS library would some sort of new method be needed so that these impressions arena't undercounted?
On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith samsmith@wikimedia.org wrote:
On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso abaso@wikimedia.org wrote:
Adding to this, one thing to consider is DNT - is there a way to invoke EL so that such traffic is appropriately imputed or something?
The EventLogging client respects DNT [0]. When the user enables DNT, mw.eventLog.logEvent is a NOP.
I don't see any mention of DNT in the Varnish VCLs around the the /beacon endpoint or otherwise but it may be handled elsewhere. While it's unlikely, there's nothing stopping a client sending a well-formatted request to the /beacon/event endpoint directly [1], ignoring the user's choice.
-Sam
[0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/ master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91 5c1755223fd7a5bab9b9$251 [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c17 55223fd7a5bab9b9$215
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
library would some sort of new method be needed so that these impressions arena't undercounted? If we had a lot of users with DNT, maybe, from our tests when we enabled that on EL this is not the case.
Thanks, good to know - is there a report around that? I'm wondering how "missing requests" ought to be expressed with some margin of error.
Thanks, good to know - is there a report around that? I'm wondering how
"missing requests" ought to be expressed with some margin of error. I think the ones that can quantify this best is your team. If anything from what I remember from pop ups experiments the inflow of events was higher than expected calculations. Overall usage of DNT for FF users was about ~10% last time we looked at it, overall usage on our userbase is quite a bit smaller I bet.
https://blog.mozilla.org/netpolicy/2013/05/03/mozillas-new-do-not-track-dash...
On Fri, Jan 19, 2018 at 10:09 AM, Adam Baso abaso@wikimedia.org wrote:
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS library would some sort of new method be needed so that these impressions arena't undercounted? If we had a lot of users with DNT, maybe, from our tests when we enabled that on EL this is not the case.
Thanks, good to know - is there a report around that? I'm wondering how "missing requests" ought to be expressed with some margin of error.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Thanks.
On Fri, Jan 19, 2018 at 12:30 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Thanks, good to know - is there a report around that? I'm wondering how
"missing requests" ought to be expressed with some margin of error. I think the ones that can quantify this best is your team. If anything from what I remember from pop ups experiments the inflow of events was higher than expected calculations. Overall usage of DNT for FF users was about ~10% last time we looked at it, overall usage on our userbase is quite a bit smaller I bet.
https://blog.mozilla.org/netpolicy/2013/05/03/mozillas- new-do-not-track-dashboard-firefox-users-continue-to- seek-out-and-enable-dnt/
On Fri, Jan 19, 2018 at 10:09 AM, Adam Baso abaso@wikimedia.org wrote:
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS library would some sort of new method be needed so that these impressions arena't undercounted? If we had a lot of users with DNT, maybe, from our tests when we enabled that on EL this is not the case.
Thanks, good to know - is there a report around that? I'm wondering how "missing requests" ought to be expressed with some margin of error.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
*Hullo all,It seems like we've arrived at an implementation for the client-side (JS) part of this problem: use EventLogging to track a page interaction from within the Page Previews code. This'll give us the flexibility to take advantage of a stream processing solution if/when it becomes available, to push the definition of a "Page Previews page interaction" to the client, and to rely on any events that we log in the immediate future ending up in tables that we're already familiar with.In principle, I agree with Andrew's argument that adding additional filtering logic to the webrequest refinement process will make it harder to change existing definitions of views or add others in future. In practice though, we'll need to: - Ensure that the server-side EventLogging component records metadata consistent with with our existing content consumption measurement, concretely: the fields available in the https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hou... https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly table. In particular, that it either doesn't discard the client IP or utilizes the GeoIP cookie sent by the client for this schema.- Aggregate the resulting table so that it can be combined with the pageviews table to generate reports.- Ensure that the events aren't recorded in MySQL.Using the GeoIP cookie will require reconfiguring the EventLogging varnishkafka instance [0], and raises questions about the compatibility with the corresponding field in the pageviews data. Retaining the client IP will require a similar change but will also require that we share the geocoding code with whatever process we use to refine the data that we’re capturing via EventLogging. Is the geocoding code that we use on webrequest_raw available as an Hive UDF or in PySpark?Aggregating the EventLogging data in the same way that we aggregate webrequest data into pageviews data will require either: replicating the process that does this and keeping the two processes in sync; or abstracting away the source table from the aggregation process so that it can work on both tables. We’ll have to maintain the chosen approach until it’s superseded by a stream processing solution, the timeline of which is currently measured in years.My next steps are making sure that Audiences Product's requirements are all visible and to work with Tilman Bayer to create a schema that's suitable for our purposes but hopefully useful to others. Nuria has also offered to give a technical overview of EventLogging, which I think would be a great resource for everyone so I'll look into setting up a meeting. I'd appreciate it if someone could estimate how much work it will be to implement GeoIP information and the other fields from Pageview hourly for EventLogging events on a per-schema basis.Thanks,-Sam[0] https://phabricator.wikimedia.org/source/operations-puppet/browse/production... https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37*
CoOOOl :)
Using the GeoIP cookie will require reconfiguring the EventLogging
varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.
However, if y’all can’t or don’t want to use the country cookie, then yaaa, we gotta figure out what to do about IPs and geocoding in EventLogging. There are a few options here, but none of them are great. The options basically are variations on ‘treat this event schema as special and make special conditionals in EventLogging processor code’, or, 'include IP and/or geocode all events in all schemas'. We’re not sure which we want to do yet, but we did mention this at our offsite today. I think we’ll figure this out and make it happen in the next week or two. Whatever the implementation ends up being, we’ll get geocoded data into this dataset.
Is the geocoding code that we use on webrequest_raw available as an Hive
UDF or in PySpark? The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive UDF https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/GetGeoDataUDF.java which ultimately just calls this getGeocodedData https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Geocode.java#L138 function, which itself is just a wrapper around the Maxmind API. We may end up doing geocoding in the EventLogging server codebase (again, really not sure about this yet…), but if we do it will use the same Maxmind databases.
Aggregating the EventLogging data in the same way that we aggregate
webrequest data into pageviews data will require either: replicating the process that does this and keeping the two processes in sync; or abstracting away the source table from the aggregation process so that it can work on both tables
I’m not totally sure if this works for you all, but I had pictured generating aggregates from the page preview events, and then joining the page preview aggregates with the pageview aggregates into a new table with an extra dimension specifying which type of content view was made.
I’d appreciate it if someone could estimate how much work it will be to
implement GeoIP information and the other fields from Pageview hourly for EventLogging events
Ya we gotta figure this out still, but actual implementation shouldn’t be difficult, however we decide to do it.
On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith samsmith@wikimedia.org wrote:
*Hullo all,It seems like we've arrived at an implementation for the client-side (JS) part of this problem: use EventLogging to track a page interaction from within the Page Previews code. This'll give us the flexibility to take advantage of a stream processing solution if/when it becomes available, to push the definition of a "Page Previews page interaction" to the client, and to rely on any events that we log in the immediate future ending up in tables that we're already familiar with.In principle, I agree with Andrew's argument that adding additional filtering logic to the webrequest refinement process will make it harder to change existing definitions of views or add others in future. In practice though, we'll need to: - Ensure that the server-side EventLogging component records metadata consistent with with our existing content consumption measurement, concretely: the fields available in the https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hou... https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly table. In particular, that it either doesn't discard the client IP or utilizes the GeoIP cookie sent by the client for this schema.- Aggregate the resulting table so that it can be combined with the pageviews table to generate reports.- Ensure that the events aren't recorded in MySQL.Using the GeoIP cookie will require reconfiguring the EventLogging varnishkafka instance [0], and raises questions about the compatibility with the corresponding field in the pageviews data. Retaining the client IP will require a similar change but will also require that we share the geocoding code with whatever process we use to refine the data that we’re capturing via EventLogging. Is the geocoding code that we use on webrequest_raw available as an Hive UDF or in PySpark?Aggregating the EventLogging data in the same way that we aggregate webrequest data into pageviews data will require either: replicating the process that does this and keeping the two processes in sync; or abstracting away the source table from the aggregation process so that it can work on both tables. We’ll have to maintain the chosen approach until it’s superseded by a stream processing solution, the timeline of which is currently measured in years.My next steps are making sure that Audiences Product's requirements are all visible and to work with Tilman Bayer to create a schema that's suitable for our purposes but hopefully useful to others. Nuria has also offered to give a technical overview of EventLogging, which I think would be a great resource for everyone so I'll look into setting up a meeting. I'd appreciate it if someone could estimate how much work it will be to implement GeoIP information and the other fields from Pageview hourly for EventLogging events on a per-schema basis.Thanks,-Sam[0] https://phabricator.wikimedia.org/source/operations-puppet/browse/production... https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37*
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I’m not totally sure if this works for you all, but I had pictured
generating aggregates from the page preview events, and then joining the page preview aggregates with the >pageview aggregates into a new table with an extra dimension specifying which type of content view was made.
On my opinion the aggregated data should stay in two different tables. I can see a future where the preview data is of different types (might include rich media that was/was not played, there are simple popups and "richer" ones .. whatever) and the dimensions where you represent this consumption are not going to match with pageview_hourly which again only represents well full page loads.
On Tue, Jan 30, 2018 at 12:02 AM, Andrew Otto otto@wikimedia.org wrote:
CoOOOl :)
Using the GeoIP cookie will require reconfiguring the EventLogging
varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.
However, if y’all can’t or don’t want to use the country cookie, then yaaa, we gotta figure out what to do about IPs and geocoding in EventLogging. There are a few options here, but none of them are great. The options basically are variations on ‘treat this event schema as special and make special conditionals in EventLogging processor code’, or, 'include IP and/or geocode all events in all schemas'. We’re not sure which we want to do yet, but we did mention this at our offsite today. I think we’ll figure this out and make it happen in the next week or two. Whatever the implementation ends up being, we’ll get geocoded data into this dataset.
Is the geocoding code that we use on webrequest_raw available as an
Hive UDF or in PySpark? The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive UDF https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/GetGeoDataUDF.java which ultimately just calls this getGeocodedData https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Geocode.java#L138 function, which itself is just a wrapper around the Maxmind API. We may end up doing geocoding in the EventLogging server codebase (again, really not sure about this yet…), but if we do it will use the same Maxmind databases.
Aggregating the EventLogging data in the same way that we aggregate
webrequest data into pageviews data will require either: replicating the process that does this and keeping the two processes in sync; or abstracting away the source table from the aggregation process so that it can work on both tables
I’m not totally sure if this works for you all, but I had pictured generating aggregates from the page preview events, and then joining the page preview aggregates with the pageview aggregates into a new table with an extra dimension specifying which type of content view was made.
I’d appreciate it if someone could estimate how much work it will be
to implement GeoIP information and the other fields from Pageview hourly for EventLogging events
Ya we gotta figure this out still, but actual implementation shouldn’t be difficult, however we decide to do it.
On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith samsmith@wikimedia.org wrote:
*Hullo all,It seems like we've arrived at an implementation for the client-side (JS) part of this problem: use EventLogging to track a page interaction from within the Page Previews code. This'll give us the flexibility to take advantage of a stream processing solution if/when it becomes available, to push the definition of a "Page Previews page interaction" to the client, and to rely on any events that we log in the immediate future ending up in tables that we're already familiar with.In principle, I agree with Andrew's argument that adding additional filtering logic to the webrequest refinement process will make it harder to change existing definitions of views or add others in future. In practice though, we'll need to: - Ensure that the server-side EventLogging component records metadata consistent with with our existing content consumption measurement, concretely: the fields available in the https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hou... https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly table. In particular, that it either doesn't discard the client IP or utilizes the GeoIP cookie sent by the client for this schema.- Aggregate the resulting table so that it can be combined with the pageviews table to generate reports.- Ensure that the events aren't recorded in MySQL.Using the GeoIP cookie will require reconfiguring the EventLogging varnishkafka instance [0], and raises questions about the compatibility with the corresponding field in the pageviews data. Retaining the client IP will require a similar change but will also require that we share the geocoding code with whatever process we use to refine the data that we’re capturing via EventLogging. Is the geocoding code that we use on webrequest_raw available as an Hive UDF or in PySpark?Aggregating the EventLogging data in the same way that we aggregate webrequest data into pageviews data will require either: replicating the process that does this and keeping the two processes in sync; or abstracting away the source table from the aggregation process so that it can work on both tables. We’ll have to maintain the chosen approach until it’s superseded by a stream processing solution, the timeline of which is currently measured in years.My next steps are making sure that Audiences Product's requirements are all visible and to work with Tilman Bayer to create a schema that's suitable for our purposes but hopefully useful to others. Nuria has also offered to give a technical overview of EventLogging, which I think would be a great resource for everyone so I'll look into setting up a meeting. I'd appreciate it if someone could estimate how much work it will be to implement GeoIP information and the other fields from Pageview hourly for EventLogging events on a per-schema basis.Thanks,-Sam[0] https://phabricator.wikimedia.org/source/operations-puppet/browse/production... https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manifests/cache/kafka/eventlogging.pp;52da8d06c760cd4e31b068d1a0392e3b3889033c$37*
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto otto@wikimedia.org wrote:
Using the GeoIP cookie will require reconfiguring the EventLogging
varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.
Hah! I didn't think about accessing the GeoIP cookie on the client. As you say, the implementation is quite easy.
My only concern with this approach is the duplication of the value between the cookie, which is sent in every HTTP request to the /beacon/event endpoint, and the event itself. This duplication seems reasonable when balanced against capturing either: the client IP and then doing similar geocoding further along in the pipeline; or the cookie for all requests to that endpoint and then discarding them further along in the pipeline. It also reflects a seemingly core principle of the EventLogging system: that it doesn't capture potentiallly PII by default.
-Sam
Wow Sam, yeah, if this cookie works for you, it will make many things much easier for us. Check it out and let us know. If it doesn’t work for some reason, we can figure out the backend geocoding part.
On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith samsmith@wikimedia.org wrote:
On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto otto@wikimedia.org wrote:
Using the GeoIP cookie will require reconfiguring the EventLogging
varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.
Hah! I didn't think about accessing the GeoIP cookie on the client. As you say, the implementation is quite easy.
My only concern with this approach is the duplication of the value between the cookie, which is sent in every HTTP request to the /beacon/event endpoint, and the event itself. This duplication seems reasonable when balanced against capturing either: the client IP and then doing similar geocoding further along in the pipeline; or the cookie for all requests to that endpoint and then discarding them further along in the pipeline. It also reflects a seemingly core principle of the EventLogging system: that it doesn't capture potentiallly PII by default.
-Sam
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Wow Sam, yeah, if this cookie works for you, it will make many things much
easier for us This is how it is done on performance schemas for Navigation timing data per country, so there is a precedence. https://github.com/wikimedia/mediawiki-extensions-NavigationTiming/blob/mast...
In this case because a preview request must happen after a full page download the cookie will always be available. Now, the cookie mappings are of this form US:WA:Seattle so they would need further processing to be akin to the current pageviews split.
On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto otto@wikimedia.org wrote:
Wow Sam, yeah, if this cookie works for you, it will make many things much easier for us. Check it out and let us know. If it doesn’t work for some reason, we can figure out the backend geocoding part.
On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith samsmith@wikimedia.org wrote:
On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto otto@wikimedia.org wrote:
Using the GeoIP cookie will require reconfiguring the EventLogging
varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.
Hah! I didn't think about accessing the GeoIP cookie on the client. As you say, the implementation is quite easy.
My only concern with this approach is the duplication of the value between the cookie, which is sent in every HTTP request to the /beacon/event endpoint, and the event itself. This duplication seems reasonable when balanced against capturing either: the client IP and then doing similar geocoding further along in the pipeline; or the cookie for all requests to that endpoint and then discarding them further along in the pipeline. It also reflects a seemingly core principle of the EventLogging system: that it doesn't capture potentiallly PII by default.
-Sam
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Just a quick update: I've captured details from this discussion and the background in https://phabricator.wikimedia.org/T184793. I'd sure appreciate your feedback.
-Sam
Thanks everyone! Separate from Sam's mapping out the frontend instrumentation work at https://phabricator.wikimedia.org/T184793 , I have created a task for the backend work at https://phabricator.wikimedia.org/T186728 based on this thread.
Regarding the last few posts about the geolocation information, from the data analysis perspective, there is indeed another, more serious concern about using the GeoIP cookie: It will create significant discrepancies with the existing geolocation data we record for pageviews, where we have chosen to derive this information from the IP instead. (Remember the overarching goal here of measuring page previews the same way we measure page views currently; the basic principle is that if a reader visits a page and then uses the page preview feature on that page to read preview cards, all the metadata that is recorded for both should have identical values for both the preview and the pageview.) Therefore, we should go with the kind of solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto otto@wikimedia.org wrote:
Wow Sam, yeah, if this cookie works for you, it will make many things much easier for us. Check it out and let us know. If it doesn’t work for some reason, we can figure out the backend geocoding part.
On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith samsmith@wikimedia.org wrote:
On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto otto@wikimedia.org wrote:
Using the GeoIP cookie will require reconfiguring the EventLogging
varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.
Hah! I didn't think about accessing the GeoIP cookie on the client. As you say, the implementation is quite easy.
My only concern with this approach is the duplication of the value between the cookie, which is sent in every HTTP request to the /beacon/event endpoint, and the event itself. This duplication seems reasonable when balanced against capturing either: the client IP and then doing similar geocoding further along in the pipeline; or the cookie for all requests to that endpoint and then discarding them further along in the pipeline. It also reflects a seemingly core principle of the EventLogging system: that it doesn't capture potentiallly PII by default.
-Sam
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
It will create significant discrepancies with the existing geolocation
data we record for pageviews If you only need country (or whatever is in the cookie), then likely whatever the output dataset is would only include country when selecting from pageviews. If you need more than country (it sounded like you didn’t), then we can get into doing the IP Geocoding in EventLogging, but there are few technical complications here, and we’re prefer not to have to do this if we don’t have to.
On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer tbayer@wikimedia.org wrote:
Thanks everyone! Separate from Sam's mapping out the frontend instrumentation work at https://phabricator.wikimedia.org/T184793 , I have created a task for the backend work at https://phabricator.wikimedia. org/T186728 based on this thread.
Regarding the last few posts about the geolocation information, from the data analysis perspective, there is indeed another, more serious concern about using the GeoIP cookie: It will create significant discrepancies with the existing geolocation data we record for pageviews, where we have chosen to derive this information from the IP instead. (Remember the overarching goal here of measuring page previews the same way we measure page views currently; the basic principle is that if a reader visits a page and then uses the page preview feature on that page to read preview cards, all the metadata that is recorded for both should have identical values for both the preview and the pageview.) Therefore, we should go with the kind of solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto otto@wikimedia.org wrote:
Wow Sam, yeah, if this cookie works for you, it will make many things much easier for us. Check it out and let us know. If it doesn’t work for some reason, we can figure out the backend geocoding part.
On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith samsmith@wikimedia.org wrote:
On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto otto@wikimedia.org wrote:
Using the GeoIP cookie will require reconfiguring the EventLogging
varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.
Hah! I didn't think about accessing the GeoIP cookie on the client. As you say, the implementation is quite easy.
My only concern with this approach is the duplication of the value between the cookie, which is sent in every HTTP request to the /beacon/event endpoint, and the event itself. This duplication seems reasonable when balanced against capturing either: the client IP and then doing similar geocoding further along in the pipeline; or the cookie for all requests to that endpoint and then discarding them further along in the pipeline. It also reflects a seemingly core principle of the EventLogging system: that it doesn't capture potentiallly PII by default.
-Sam
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto otto@wikimedia.org wrote:
It will create significant discrepancies with the existing geolocation data we record for pageviews
If you only need country (or whatever is in the cookie), then likely whatever the output dataset is would only include country when selecting from pageviews. If you need more than country (it sounded like you didn’t), then we can get into doing the IP Geocoding in EventLogging, but there are few technical complications here, and we’re prefer not to have to do this if we don’t have to.
As mentioned repeatedly in this thread (see e.g. Sam's Jan 29 email), the goal is to record metadata consistent with with our existing content consumption measurement, concretely: the fields available in the pageview_hourly table. See https://phabricator.wikimedia.org/T186728 for details (also regarding other fields that are not in EL by default but are likewise generated in a standard fashion for webrequest/pageview data).
I appreciate it will need a bit of engineering work to implement your proposal of reusing the existing UDF that underlies the pageview data for the new preview data. But it will serve to avoid a lot of data limitations and headaches for years to come. To highlight just one aspect: If we relied on the cookie, the data would be inconsistent from the start because not all clients accept cookies. When we want to know (say) the ratio of previews to pageviews in a particular country, we don't want to have to embark on a research project estimating the number of cookie-less pageviews in that country. And so on.
On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer tbayer@wikimedia.org wrote:
Thanks everyone! Separate from Sam's mapping out the frontend instrumentation work at https://phabricator.wikimedia.org/T184793 , I have created a task for the backend work at https://phabricator.wikimedia.org/T186728 based on this thread.
Regarding the last few posts about the geolocation information, from the data analysis perspective, there is indeed another, more serious concern about using the GeoIP cookie: It will create significant discrepancies with the existing geolocation data we record for pageviews, where we have chosen to derive this information from the IP instead. (Remember the overarching goal here of measuring page previews the same way we measure page views currently; the basic principle is that if a reader visits a page and then uses the page preview feature on that page to read preview cards, all the metadata that is recorded for both should have identical values for both the preview and the pageview.) Therefore, we should go with the kind of solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto otto@wikimedia.org wrote:
Wow Sam, yeah, if this cookie works for you, it will make many things much easier for us. Check it out and let us know. If it doesn’t work for some reason, we can figure out the backend geocoding part.
On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith samsmith@wikimedia.org wrote:
On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto otto@wikimedia.org wrote:
Using the GeoIP cookie will require reconfiguring the EventLogging varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.
Hah! I didn't think about accessing the GeoIP cookie on the client. As you say, the implementation is quite easy.
My only concern with this approach is the duplication of the value between the cookie, which is sent in every HTTP request to the /beacon/event endpoint, and the event itself. This duplication seems reasonable when balanced against capturing either: the client IP and then doing similar geocoding further along in the pipeline; or the cookie for all requests to that endpoint and then discarding them further along in the pipeline. It also reflects a seemingly core principle of the EventLogging system: that it doesn't capture potentiallly PII by default.
-Sam
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Gonna paste your reply on the ticket https://phabricator.wikimedia.org/T184793 and respond there.
On Wed, Feb 7, 2018 at 1:29 PM, Tilman Bayer tbayer@wikimedia.org wrote:
On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto otto@wikimedia.org wrote:
It will create significant discrepancies with the existing geolocation data we record for pageviews
If you only need country (or whatever is in the cookie), then likely whatever the output dataset is would only include country when selecting from pageviews. If you need more than country (it sounded like you
didn’t),
then we can get into doing the IP Geocoding in EventLogging, but there
are
few technical complications here, and we’re prefer not to have to do
this if
we don’t have to.
As mentioned repeatedly in this thread (see e.g. Sam's Jan 29 email), the goal is to record metadata consistent with with our existing content consumption measurement, concretely: the fields available in the pageview_hourly table. See https://phabricator.wikimedia.org/T186728 for details (also regarding other fields that are not in EL by default but are likewise generated in a standard fashion for webrequest/pageview data).
I appreciate it will need a bit of engineering work to implement your proposal of reusing the existing UDF that underlies the pageview data for the new preview data. But it will serve to avoid a lot of data limitations and headaches for years to come. To highlight just one aspect: If we relied on the cookie, the data would be inconsistent from the start because not all clients accept cookies. When we want to know (say) the ratio of previews to pageviews in a particular country, we don't want to have to embark on a research project estimating the number of cookie-less pageviews in that country. And so on.
On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer tbayer@wikimedia.org
wrote:
Thanks everyone! Separate from Sam's mapping out the frontend instrumentation work at https://phabricator.wikimedia.org/T184793 , I
have
created a task for the backend work at https://phabricator.wikimedia.org/T186728 based on this thread.
Regarding the last few posts about the geolocation information, from the data analysis perspective, there is indeed another, more serious concern about using the GeoIP cookie: It will create significant discrepancies
with
the existing geolocation data we record for pageviews, where we have
chosen
to derive this information from the IP instead. (Remember the
overarching
goal here of measuring page previews the same way we measure page views currently; the basic principle is that if a reader visits a page and
then
uses the page preview feature on that page to read preview cards, all
the
metadata that is recorded for both should have identical values for
both the
preview and the pageview.) Therefore, we should go with the kind of
solution
Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto otto@wikimedia.org wrote:
Wow Sam, yeah, if this cookie works for you, it will make many things much easier for us. Check it out and let us know. If it doesn’t work
for
some reason, we can figure out the backend geocoding part.
On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith samsmith@wikimedia.org
wrote:
On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto otto@wikimedia.org
wrote:
> Using the GeoIP cookie will require reconfiguring the EventLogging > varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country
=
response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server
side
to geocode or populate the country in the event.
Hah! I didn't think about accessing the GeoIP cookie on the client. As you say, the implementation is quite easy.
My only concern with this approach is the duplication of the value between the cookie, which is sent in every HTTP request to the
/beacon/event
endpoint, and the event itself. This duplication seems reasonable when balanced against capturing either: the client IP and then doing
similar
geocoding further along in the pipeline; or the cookie for all
requests to
that endpoint and then discarding them further along in the pipeline.
It
also reflects a seemingly core principle of the EventLogging system:
that it
doesn't capture potentiallly PII by default.
-Sam
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Regarding the last few posts about the geolocation information, from the
data analysis perspective, there is indeed another, more serious concern about using the GeoIP cookie: >It will create significant discrepancies with the existing geolocation data we record for pageviews, where we have chosen to derive this information from the IP instead
How did you came to the conclusion that the data will differ?
GeoIP cookie is inferred from your IP just the same, right? https://github.com/wikimedia/puppet/blob/production/modules/varnish/template...
On Wed, Feb 7, 2018 at 9:09 AM, Tilman Bayer tbayer@wikimedia.org wrote:
Thanks everyone! Separate from Sam's mapping out the frontend instrumentation work at https://phabricator.wikimedia.org/T184793 , I have created a task for the backend work at https://phabricator.wikimedia. org/T186728 based on this thread.
Regarding the last few posts about the geolocation information, from the data analysis perspective, there is indeed another, more serious concern about using the GeoIP cookie: It will create significant discrepancies with the existing geolocation data we record for pageviews, where we have chosen to derive this information from the IP instead. (Remember the overarching goal here of measuring page previews the same way we measure page views currently; the basic principle is that if a reader visits a page and then uses the page preview feature on that page to read preview cards, all the metadata that is recorded for both should have identical values for both the preview and the pageview.) Therefore, we should go with the kind of solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto otto@wikimedia.org wrote:
Wow Sam, yeah, if this cookie works for you, it will make many things much easier for us. Check it out and let us know. If it doesn’t work for some reason, we can figure out the backend geocoding part.
On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith samsmith@wikimedia.org wrote:
On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto otto@wikimedia.org wrote:
Using the GeoIP cookie will require reconfiguring the EventLogging
varnishkafka instance [0]
I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);
That way, there’s no additional special logic needed on the server side to geocode or populate the country in the event.
Hah! I didn't think about accessing the GeoIP cookie on the client. As you say, the implementation is quite easy.
My only concern with this approach is the duplication of the value between the cookie, which is sent in every HTTP request to the /beacon/event endpoint, and the event itself. This duplication seems reasonable when balanced against capturing either: the client IP and then doing similar geocoding further along in the pipeline; or the cookie for all requests to that endpoint and then discarding them further along in the pipeline. It also reflects a seemingly core principle of the EventLogging system: that it doesn't capture potentiallly PII by default.
-Sam
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Can we keep further discussion on the phablet thread?
Wow auto complete, you know what I mean. :)
On Wed, Feb 7, 2018 at 4:45 PM, Andrew Otto otto@wikimedia.org wrote:
Can we keep further discussion on the phablet thread?
In particular, will we be able to sort by country, OS, Browser, etc?
Yes, as Andrew said EL would need to parse IP addresses and publish that data.
On Thu, Jan 18, 2018 at 12:13 PM, Andrew Otto otto@wikimedia.org wrote:
In particular, will we be able to sort by country, OS, Browser, etc?
OS, Browser, yes. User Agent parsing is done by the EventLogging processors.
Country not quite as easily, as EventLogging does not include client IP addresses. We could consider putting this back in somehow, or, I’ve also heard that there is a geocoded country cookie that varnish will set that the browser could send back as part of the event. Is country enough geo detail?
On Thu, Jan 18, 2018 at 2:30 PM, Olga Vasileva ovasileva@wikimedia.org wrote:
Hi all,
I just want to confirm that the proposed method using Eventlogging will allow us to gather data in a similar fashion to the web request table. In particular, will we be able to sort by country, OS, Browser, etc? Our goal here is to be able to consider the new page interactions metric on the same level and with the same depth as pageviews.
Thanks!
- Olga
On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto otto@wikimedia.org wrote:
the beacon puts the record into the webrequest table and from there
it would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
tracking of events is better done on an event based system and EL is
such a system. I agree with this too. We really want to discourage people from trying to measure things by searching through the huge haystack of all webrequests. To measure something, you should emit an event if you can. If it were practical, I’d prefer that we did this for pageviews as well. Currently, we need a complicated definition of what a pageview is, which really only exists in the Java implementation in the Hadoop cluster. It’d be much clearer if app developers had a way to define themselves what counts as a pageview, and emit that as an event.
This should be the approach that people take when they want to measure something new. Emit an event! This event will get its own Kafka topic (you can consume this to do whatever you like with it), and be refined into its own Hive table.
I don’t want to have to create that chart and export one dataset
from pageviews and one dataset from eventlogging to do that. If you also design your schema nicely https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines, it will be easily importable into Druid and usable in Pivot and Superset, alongside of pageviews. We’re working on getting nice schemas automatically imported into druid https://gerrit.wikimedia.org/r/#/c/386882/.
On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/?presele ct_filters=%7B%7D
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
https://pivot.wikimedia.org/#tbayer_popups
I was going to make the point that #2 already has a processing
pipeline established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see: https://github.com/wikimedia/analytics-refinery-source/blob/ master/refinery-core/src/main/java/org/wikimedia/analytics/ refinery/core/PageviewDefinition.java#L144 Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Olga Vasileva // Product Manager // Reading Web Team https://wikimediafoundation.org/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto otto@wikimedia.org wrote:
the beacon puts the record into the webrequest table and from there it
would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
I think Gergo may have been referring to the human work involved in implementing that preprocessing step. I assume it could be quite analogous to the one your team has implemented for pageviews: https://github.com/ wikimedia/analytics-refinery/blob/master/oozie/pageview/ hourly/pageview_hourly.hql
Are you saying that the server load generated by such an additional aggregation query would be a blocker? If yes, how about we combine the two (for pageviews and previews) into one?
Are you saying that the server load generated by such an additional
aggregation query would be a blocker? If yes, how about we combine the two (for pageviews and previews) into one?
Sorry, no it isn’t a blocker. The tagging logic that Nuria and others have been working on for a while now makes this a little easier, since the webrequests only need to be read once to add all tags. It is separate than pageviews (for now), but we might use tagging for pageviews eventually too.
I assume it could be quite analogous to the one your team has
implemented for pageviews If we did it like the linked Hive query, it would be quite a lot. We don’t want to read every webrequest from disk for every aggregate dataset. Tagging helps, since we define the set of tags and filters once, and the job that adds tags reads all webrequest once, and adds all tags.
But anyway, yes, it can be done.
I’m mostly objecting and recommending EventLogging because we really shouldn’t’ be doing searching webrequest to measure interactions over and over again. It's fragile and monolithic and not very portable. Events are better :)
On Thu, Jan 18, 2018 at 6:44 PM, Tilman Bayer tbayer@wikimedia.org wrote:
On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto otto@wikimedia.org wrote:
the beacon puts the record into the webrequest table and from there it
would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
I think Gergo may have been referring to the human work involved in implementing that preprocessing step. I assume it could be quite analogous to the one your team has implemented for pageviews: https://github.com/ wikimedia/analytics-refinery/blob/master/oozie/pageview/hour ly/pageview_hourly.hql
Are you saying that the server load generated by such an additional aggregation query would be a blocker? If yes, how about we combine the two (for pageviews and previews) into one?
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
I don't see how this addresses Gergo's larger point about the difference between consistently tallying content consumption (pageviews, previews, mediaviewer image views) and analyzing UI interactions (which is the main use case that EventLogging has been developed and used for). There are really quite a few differences between these two. For example, UI instrumentations on the web are almost always sampled, because that yields enough data to answer UI questions - but on the other hand tend to record much more detail about the individual interaction. In contrast, we register all pageviews unsampled, but don't keep a permanent record of every single one of them with precise timestamps - rather, we have aggregated tables (pageview_hourly in particular). Our EventLogging backend is not tailored to that.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/?presele ct_filters=%7B%7D
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
That's actually not the "very same data we are talking about". You can rest assured that the web team (and Sam in particular) has already been aware of the existence of the Popups instrumentation for page previews. The team spent considerable effort building it in order to understand how users interact with the feature's UI. Now comes the separate effort of systematically tallying content consumption from this new channel. Superset and Pivot are great, but are nowhere near providing all the ways that WMF analysts and community members currently have to study pageview data. Storing data about seen previews in the same way as we do for pageviews, for example in the pageview_hourly (suitably tagged, perhaps giving that table a more general name) would facilitate that a lot, by allowing us to largely reuse the work that during the past few years went into getting pageview aggregation right.
I was going to make the point that #2 already has a processing pipeline
established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see: https://github.com/wikimedia/analytics-refinery-source/blob/ master/refinery-core/src/main/java/org/wikimedia/analytics/r efinery/core/PageviewDefinition.java#L144 Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
Again, tracking of individual events is not the ultimate goal here.
For example, UI instrumentations on the web are almost always sampled,
because that yields enough data to answer UI questions - but on the other hand tend to record much more detail about the individual interaction. In contrast, we register all pageviews unsampled, but don't keep a permanent record of every single one of them with precise timestamps - rather, we have aggregated tables (pageview_hourly in particular). Our EventLogging backend is not tailored to that.
When you say “Our EventLogging backend here”, what are you referring to? If MySQL, then for sure. :)
Storing data about seen previews in the same way as we do for pageviews,
for example in the pageview_hourly (suitably tagged, perhaps giving that table a more general name) would facilitate that a lot, by allowing us to largely reuse the work that during the past few years went into getting pageview aggregation right.
I’m not totally opposed to doing it this way, but at some point we need to realize that this isn’t a scalable (human and CPU resource wise) way to measure user feature interaction.
I don’t think a pageview is inherently different than any other kind of impression, it’s just that we didn’t have the ability in the past (or now?) for pageviews to be collected and measured like they should. If we were designing an interaction measurement system now, it wouldn’t look exactly like EventLogging, but it would look like something close to it. And if it did everything I’d want it to, we would use it to measure pageviews and everything else you’ve mentioned.
Making events be the source of truth is more accurate than implementing custom batch logic in Hadoop to comb through webrequests and filter out what you are looking for. It pushes control of the definition of what counts as a ‘pageview’ or ‘page preview’ to the folks who are developing the app/website/feature. If we use webrequests+Hadoop tagging to count these, any time in the future there is a change to the URLs that page previews load (or the beacon URLs they hit), we’d have to make a patch to the tagging logic and release and deploy a new refinery version to account for the change. Any time a new feature is added for which someone wants interactions counted, we have to do the same.
Heck, if you use events, you could very easily consume and/or aggregate or emit them to anywhere you wanted. Your own datastore, a grafana dashboard, a monitoring system, etc. etc. :) It also will help us to standardize this type of thing, so that in the future creation of new dashboards can be more automated.
On Thu, Jan 18, 2018 at 6:17 PM, Tilman Bayer tbayer@wikimedia.org wrote:
On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
I don't see how this addresses Gergo's larger point about the difference between consistently tallying content consumption (pageviews, previews, mediaviewer image views) and analyzing UI interactions (which is the main use case that EventLogging has been developed and used for). There are really quite a few differences between these two. For example, UI instrumentations on the web are almost always sampled, because that yields enough data to answer UI questions - but on the other hand tend to record much more detail about the individual interaction. In contrast, we register all pageviews unsampled, but don't keep a permanent record of every single one of them with precise timestamps - rather, we have aggregated tables (pageview_hourly in particular). Our EventLogging backend is not tailored to that.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/?presele ct_filters=%7B%7D
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
That's actually not the "very same data we are talking about". You can rest assured that the web team (and Sam in particular) has already been aware of the existence of the Popups instrumentation for page previews. The team spent considerable effort building it in order to understand how users interact with the feature's UI. Now comes the separate effort of systematically tallying content consumption from this new channel. Superset and Pivot are great, but are nowhere near providing all the ways that WMF analysts and community members currently have to study pageview data. Storing data about seen previews in the same way as we do for pageviews, for example in the pageview_hourly (suitably tagged, perhaps giving that table a more general name) would facilitate that a lot, by allowing us to largely reuse the work that during the past few years went into getting pageview aggregation right.
I was going to make the point that #2 already has a processing pipeline
established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see: https://github.com/wikimedia/analytics-refinery-source/blob/ master/refinery-core/src/main/java/org/wikimedia/analytics/r efinery/core/PageviewDefinition.java#L144 Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
Again, tracking of individual events is not the ultimate goal here.
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I don't see how this addresses Gergo's larger point about the difference
between consistently tallying content consumption (pageviews, previews, mediaviewer image views) >and analyzing UI interactions (which is the main use case that EventLogging has been developed and used for).
Event logging use cases are events, as we move to a thicker client -more javascript heavy- you will be needing to measure events for -nearly- everything, whether those are to be consider "content consumption" or "ui interaction" is not that relevant. Example: video plays are content consumption and are also "ui interactions".
We are the only major website that does not have a thick client and this notion of joining UI interactions and consumption is new to us but really it is not that new at all.
On Thu, Jan 18, 2018 at 3:17 PM, Tilman Bayer tbayer@wikimedia.org wrote:
On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Gergo,
while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top of some eventlogging schemas on superset in the same way you have a dashboard that displays pageview data on superset.
I don't see how this addresses Gergo's larger point about the difference between consistently tallying content consumption (pageviews, previews, mediaviewer image views) and analyzing UI interactions (which is the main use case that EventLogging has been developed and used for). There are really quite a few differences between these two. For example, UI instrumentations on the web are almost always sampled, because that yields enough data to answer UI questions - but on the other hand tend to record much more detail about the individual interaction. In contrast, we register all pageviews unsampled, but don't keep a permanent record of every single one of them with precise timestamps - rather, we have aggregated tables (pageview_hourly in particular). Our EventLogging backend is not tailored to that.
See dashboards on superset (user required).
https://superset.wikimedia.org/superset/dashboard/7/?presele ct_filters=%7B%7D
And (again, user required) EL data on druid, this very same data we are talking about, page previews:
That's actually not the "very same data we are talking about". You can rest assured that the web team (and Sam in particular) has already been aware of the existence of the Popups instrumentation for page previews. The team spent considerable effort building it in order to understand how users interact with the feature's UI. Now comes the separate effort of systematically tallying content consumption from this new channel. Superset and Pivot are great, but are nowhere near providing all the ways that WMF analysts and community members currently have to study pageview data. Storing data about seen previews in the same way as we do for pageviews, for example in the pageview_hourly (suitably tagged, perhaps giving that table a more general name) would facilitate that a lot, by allowing us to largely reuse the work that during the past few years went into getting pageview aggregation right.
I was going to make the point that #2 already has a processing pipeline
established whereas #1 doesn't. This is incorrect, we mark as "preview" data that we want to exclude from processing, see: https://github.com/wikimedia/analytics-refinery-source/blob/ master/refinery-core/src/main/java/org/wikimedia/analytics/r efinery/core/PageviewDefinition.java#L144 Naming is unfortunate but previews are really "preloads" as in requests we make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and EL is such a system.
Again, tracking of individual events is not the ultimate goal here.
-- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 18, 2018 at 3:56 PM, Nuria Ruiz nuria@wikimedia.org wrote:
Event logging use cases are events, as we move to a thicker client -more javascript heavy- you will be needing to measure events for -nearly- everything, whether those are to be consider "content consumption" or "ui interaction" is not that relevant. Example: video plays are content consumption and are also "ui interactions".
That could be an argument for not separating pageviews from events (in which the question of whether virtual pageviews should be more like pageviews or more like events would be moot), but given that those *are* separated I don't see how it applies. In the current analytics setup, and given what kinds of frontends are currently supported, there are types of report generation that are easier to perform on pageviews and not so easy on events, and other types of report generation that are easier to do on events. For virtual pageviews, people will probably be more interested in reports that belong to the first group (summing them up with normal pageviews, breaking them down along the dimensions that are relevant for web traffic, counting them for a given URL etc).
On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto otto@wikimedia.org wrote:
the beacon puts the record into the webrequest table and from there it
would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work!
Is that different from preprocessing them via EventLogging? Either way you take a HTTP request, and end up with a Hadoop record - is there something that makes that process a lot more costly for normal pageviews than EventLogging beacon hits?
Anyway what I meant by trivial preprocessing is that you take something like *http://bits.wikimedia.org/beacon/page-preview?duration=123&uri=https%3A%... http://bits.wikimedia.org/beacon/page-preview?duration=123&uri=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFoo*, convert it into *https://en.wikipedia.org/wiki/Foo https://en.wikipedia.org/wiki/Foo*, tack the duration and the type ('page-preview') into some extra fields, add those extra fields to the dimensions along which pageviews can be inspected, and you have integrated virtual views into your analytics APIs / UIs, almost for free. The alternative would be that every analytics customer who wants to deal with content consumption and does not want to automatically filter out content consumption happening via thick clients would have to update their interfaces and do some kind of union query to merge the data that's now distributed between the webrequest table and one or more EventLogging tables; surely that's less expedient?
If we use webrequests+Hadoop tagging to count these, any time in the future
there is a change to the URLs that page previews load (or the beacon URLs they hit), we’d have to make a patch to the tagging logic and release and deploy a new refinery version to account for the change. Any time a new feature is added for which someone wants interactions counted, we have to do the same.
There doesn't seem to be much reason for the beacon URL to ever change. As for new beacon endpoints (new virtual view types), why can't that just be a whitelist that's offloaded to configuration?
So maybe it's worth considering which approach takes us closer to that?
AIUI the beacon puts the record into the webrequest table and from there it would only take some >trivial preprocessing to replace the beacon URL with the virtual URL and and add the beacon type as a "virtual_type" field or something, making it very easy to expose it >everywhere where views are tracked, while EventLogging data gets stored in a different, unrelated way. Any thing that involves combing* 1 terabyte of data a day and 150.000 request s per second at peak *cannot be consider "simple" or "trivial". Rather than looking for a needle in the haystack rely let's please on the client to send you preselected data (events). That data can be aggregated later in different ways, and the fact that the data comes from event logging does not dictate how aggregation needs to happen.
On Wed, Jan 17, 2018 at 6:09 PM, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 17, 2018 at 10:54 AM, Nuria Ruiz nuria@wikimedia.org wrote:
Recording "preview_events" is really no different that recording any other kind of UI event, difference is going to come from scale if anything, as they are probably tens of thousands of those per second (I think your team already estimated volume, if so please send those estimates along)
Conceptually I think a virtual pageview is a different thing from a UI event (which is how e.g. Google Analytics handles it, there is a method to send an event for the current page and a different method to send a virtual pageview for a different page), and the ideal way it is exposed in an analytics system should be very different. (I would want to see virtual pageviews together with normal pageviews, with some filtering option. If I deploy code that shows previews and converts users from making real pageviews to making virtual pageviews, I want to see how the total pageviews changed in the normal pageview stats; I don't want to have to create that chart and export one dataset from pageviews and one dataset from eventlogging to do that. As a user, I want to see in the fileview API how many people looked at the photo I uploaded, I don't particularly care if they used MediaViewer or not. etc.)
So maybe it's worth considering which approach takes us closer to that? AIUI the beacon puts the record into the webrequest table and from there it would only take some trivial preprocessing to replace the beacon URL with the virtual URL and and add the beacon type as a "virtual_type" field or something, making it very easy to expose it everywhere where views are tracked, while EventLogging data gets stored in a different, unrelated way.
On Wed, Jan 17, 2018 at 10:54 AM, Nuria Ruiz nuria@wikimedia.org wrote:
(Moving ops list to bcc)
Are there other ways of recording this information? We're fairly confident that #1 seems like the best choice here but it's referred to as the "virtual file view hack". Is this really the case?
Yes, there are, please use eventlogging.
Recording "preview_events" is really no different that recording any other kind of UI event, difference is going to come from scale if anything, as they are probably tens of thousands of those per second (I think your team already estimated volume, if so please send those estimates along)
Actually Sam's email already included such a volume estimate (see [3] there for more detail, or https://phabricator.wikimedia.org/T182314#3901330 ). Rather than "tens of thousands", the current estimate is 700-800 per second.
We discourage you from sending events directly to beacon. Rather, use the EL client to send a page-preview event defined in a given schema. This is a similar approach as to how we will be measuring banner impressions for fundraising banners in the future.
Thanks,
Nuria
On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith samsmith@wikimedia.org wrote:
Hullo,
Page Previews is now fully deployed to all but 2 of the Wikipedias. In deploying it, we've created a new way to interact with pages without navigating to them. This impacts the overall and per-page pageviews metrics that are used in myriad reports, e.g. to editors about the readership of their articles and in monthly reports to the board. Consequently, we need to be able to report a user reading the preview of a page just like we do them navigating to it.
Readers Web are planning to instrument Page Previews such that when a preview is available and open for longer than X ms, a "page interaction" is recorded. We're aware of a couple of mechanisms for recording something like this from the client:
All files viewed with the media viewer are recorded by the client requesting the /beacon/media?duration=X&uri=Y URL at some point [0] – as Nuria points out in that thread, requests to /beacon/... are already filtered and a canned response is sent immediately by Varnish [1]. Requesting a URL with the X-Analytics header [2] set to "preview". In this context, we'd make a HEAD request to the URL of the page with the header set.
IMO #1 is preferable from the operations and performance perspectives as the response is always served from the edge and includes very few headers, whereas the request in #2 may be served by the application servers if the user is logged in (or in the mobile site's beta cohort). However, the requests in #2 are already
We're currently considering recording page interactions when previews are open for longer than 1000 ms. We estimate that this would increase overall web requests by 0.3% [3].
Are there other ways of recording this information? We're fairly confident that #1 seems like the best choice here but it's referred to as the "virtual file view hack". Is this really the case? Moreover, should we request a distinct URL, e.g. /beacon/preview?duration=X&uri=Y, or should we consolidate the URLs as both represent the same thing essentially?
Thanks,
-Sam
Timezone: GMT IRC (Freenode): phuedx
[0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html [1] https://phabricator.wikimedia.org/source/operations-puppet/browse/production... [2] https://wikitech.wikimedia.org/wiki/X-Analytics [3] https://phabricator.wikimedia.org/T184793#3901365
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics