the beacon puts the record into the webrequest table
and from there it
would only take some trivial preprocessing
‘Trivial’ preprocessing that has to look through 150K requests per second!
This is a lot of work!
tracking of events is better done on an event based
system and EL is such
I agree with this too. We really want to discourage people from trying to
measure things by searching through the huge haystack of all webrequests.
To measure something, you should emit an event if you can. If it were
practical, I’d prefer that we did this for pageviews as well. Currently,
we need a complicated definition of what a pageview is, which really only
exists in the Java implementation in the Hadoop cluster. It’d be much
clearer if app developers had a way to define themselves what counts as a
pageview, and emit that as an event.
This should be the approach that people take when they want to measure
something new. Emit an event! This event will get its own Kafka topic
(you can consume this to do whatever you like with it), and be refined into
its own Hive table.
I don’t want to have to create that chart and export
one dataset from
pageviews and one dataset from eventlogging to do that.
If you also design your schema nicely
it will be easily importable into Druid and usable in Pivot and Superset,
alongside of pageviews. We’re working on getting nice schemas automatically
imported into druid <https://gerrit.wikimedia.org/r/#/c/386882/>.
On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
while EventLogging data gets stored in a
different, unrelated way
Not really, This has changed quite a bit as of the last
Eventlogging data as of recent gets preprocessed and refined similar to how
webrequest data is preprocessed and refined. You can have a dashboard on
top of some eventlogging schemas on superset in the same way you have a
dashboard that displays pageview data on superset.
See dashboards on superset (user required).
And (again, user required) EL data on druid, this very same data we are
talking about, page previews:
I was going to make the point that #2 already has
a processing pipeline
established whereas #1 doesn't.
This is incorrect, we mark as "preview" data that we want to exclude from
Naming is unfortunate but previews are really "preloads" as in requests we
make (and cache locally) and maybe shown to users or not.
But again, tracking of events is better done on an event based system and
EL is such a system.
Analytics mailing list