Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

10 Jun 2015

On 10 June 2015 at 11:35, Dan Andreescu &lt;dandreescu(a)wikimedia.org&gt; wrote:
...

 On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes &lt;okeyes(a)wikimedia.org&gt; wrote:

 On 10 June 2015 at 10:53, Dan Andreescu &lt;dandreescu(a)wikimedia.org&gt; wrote:
  I see three ways for data to get into the
cluster:

 1. request stream, handled already, we're working on ways to pump the
 data
 back out through APIs 
 Awesome, and it'd end up in the Hadoop cluster in a table? How...do we
 kick that off most easily? 

 Nono, I mean our specific web request stream.  I don't think there's any way
 to piggyback onto that for arbitrary other services.  This is not an option
 for you, it's just a way that data gets into the cluster, for completeness.

    Second: what's best practices for this? What
resources are available?
 If I'm starting a service on Labs that provides data to third-parties, 

 What exactly do you mean here?  That's a loaded term and possibly
 against
 the labs privacy policy depending on what you mean.

 An API, Dan ;) 

 Ok, so ... usage of the API is what you're after, I think piwik is probably
 the best solution.

It's not. I've used Piwik before many times and it's not what we're
looking for. My question is "how do I get the request logs into HDFS?"
Your answer is a piece of software that, last time I checked, required
JS executed on the client machine and will put the data in yet another
service that can't be tightly integrated with our dashboards in the
same way.

We should have a way of doing this. If there is genuinely no way of
getting requests from labs varnish instances into HDFS, we need to
either (a) develop that or (b) stop using Labs for any kind of beta
release.

...

 what would analytics recommend my easiest path is to getting request
 logs into Hadoop? 

 Weighing everything on balance, right now I'd say adding your name to
 the
 piwik supporters.  So far, off the top of my head, that list is:

 * wikimedia store
 * annual report
 * the entire reading vertical
 * russian wikimedia chapter (most likely all other chapters would chime
 in
 supporting it)
 * a bunch of labs projects (including wikimetrics, vital signs, various
 dashboards, etc.)

 How is piwik linked to Hadoop? I'm not asking "how do we visualise the
 data" I'm asking how we get it into the cluster in the first place. 

 I think for the most part, piwik would handle reporting and crunching
 numbers for you and get you some basic reports.  But if we wanted to crunch
 tons of data, we could integrate it with hadoop somehow.

 I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did not
 happen).

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"