On Thu, Jan 18, 2018 at 3:56 PM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
Event logging use cases are events, as we move to a
thicker client -more
javascript heavy- you will be needing to measure events for -nearly-
everything, whether those are to be consider "content consumption" or
"ui
interaction" is not that relevant. Example: video plays are content
consumption and are also "ui interactions".
That could be an argument for not separating pageviews from events (in
which the question of whether virtual pageviews should be more like
pageviews or more like events would be moot), but given that those *are*
separated I don't see how it applies. In the current analytics setup, and
given what kinds of frontends are currently supported, there are types of
report generation that are easier to perform on pageviews and not so easy
on events, and other types of report generation that are easier to do on
events. For virtual pageviews, people will probably be more interested in
reports that belong to the first group (summing them up with normal
pageviews, breaking them down along the dimensions that are relevant for
web traffic, counting them for a given URL etc).
On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto <otto(a)wikimedia.org> wrote:
the beacon
puts the record into the webrequest table and from there it
would only take some
trivial preprocessing
‘Trivial’ preprocessing that has to look through 150K requests per second!
This is a lot of work!
Is that different from preprocessing them via EventLogging? Either way you
take a HTTP request, and end up with a Hadoop record - is there something
that makes that process a lot more costly for normal pageviews than
EventLogging beacon hits?
Anyway what I meant by trivial preprocessing is that you take something
like
*http://bits.wikimedia.org/beacon/page-preview?duration=123&uri=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFoo
<http://bits.wikimedia.org/beacon/page-preview?duration=123&uri=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFoo>*,
convert it into *https://en.wikipedia.org/wiki/Foo
<https://en.wikipedia.org/wiki/Foo>*, tack the duration and the type
('page-preview') into some extra fields, add those extra fields to the
dimensions along which pageviews can be inspected, and you have integrated
virtual views into your analytics APIs / UIs, almost for free. The
alternative would be that every analytics customer who wants to deal
with content
consumption and does not want to automatically filter out content
consumption happening via thick clients would have to update their
interfaces and do some kind of union query to merge the data that's now
distributed between the webrequest table and one or more EventLogging
tables; surely that's less expedient?
If we use webrequests+Hadoop tagging to count these, any time in the future
there is a change to the URLs that page previews load
(or the beacon URLs
they hit), we’d have to make a patch to the tagging logic and release and
deploy a new refinery version to account for the change. Any time a new
feature is added for which someone wants interactions counted, we have to
do the same.
There doesn't seem to be much reason for the beacon URL to ever change. As
for new beacon endpoints (new virtual view types), why can't that just be a
whitelist that's offloaded to configuration?