Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

10 Jun 2015

On 10 June 2015 at 10:53, Dan Andreescu &lt;dandreescu(a)wikimedia.org&gt; wrote:
...
  I see three ways for data to get into the cluster:

 1. request stream, handled already, we're working on ways to pump the data
 back out through APIs 
Awesome, and it'd end up in the Hadoop cluster in a table? How...do we
kick that off most easily?

...

 2. Event Logging.  We're making this scale arbitrarily by moving it to
 Kafka.  Once that's done, we should be able to instrument pretty much
 anything with Event Logging

 3. Piwik.  There is a small but growing effort to stand up our own piwik
 instance so we can get basic canned reports out of the box and not have to
 reinvent the wheel for every single feature we're trying to instrument and
 learn about.  This could replace a lot of the use cases for Event Logging
 and free up Event Logging to do more free-form research rather than cookie
 cutter web analytics.

 Answers inline:

  So I'm asking, I guess, two things. The first
is: can we have a firm
 commitment that we'll get this kind of stuff into Hadoop? Right now we
 have a RESTful API everywhere that is not (to my knowledge) throwing
 data into the request logs. We have a WDQS that isn't either.
 Undoubtedly we have other tools I haven't encountered. It's paramount
 that the first question we ask with new services or systems is "so
 when does new traffic data start hitting the analytics cluster?" 

 The commitment has to be made on both sides.  The teams building the
 services have to instrument them, picking either 2 or 3 above.  And then
 we'll commit to supporting the path they choose.  The piwik path may be slow
 right now, fair warning.

  Second: what's best practices for this? What
resources are available?
 If I'm starting a service on Labs that provides data to third-parties, 

 What exactly do you mean here?  That's a loaded term and possibly against
 the labs privacy policy depending on what you mean.

An API, Dan ;)

...

 what would analytics recommend my easiest path is to getting request
 logs into Hadoop? 

 Weighing everything on balance, right now I'd say adding your name to the
 piwik supporters.  So far, off the top of my head, that list is:

 * wikimedia store
 * annual report
 * the entire reading vertical
 * russian wikimedia chapter (most likely all other chapters would chime in
 supporting it)
 * a bunch of labs projects (including wikimetrics, vital signs, various
 dashboards, etc.)

How is piwik linked to Hadoop? I'm not asking "how do we visualise the
data" I'm asking how we get it into the cluster in the first place.

...
  _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"