Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

11 Jun 2015

      On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes okeyes@wikimedia.org wrote:
...
On 10 June 2015 at 10:53, Dan Andreescu dandreescu@wikimedia.org wrote:
...
I see three ways for data to get into the cluster:

request stream, handled already, we're working on ways to pump the

data
...
back out through APIs
Awesome, and it'd end up in the Hadoop cluster in a table? How...do we
kick that off most easily?
Nono, I mean our specific web request stream.  I don't think there's any
way to piggyback onto that for arbitrary other services.  This is not an
option for you, it's just a way that data gets into the cluster, for
completeness.
...
...
Second: what's best practices for this? What resources are available?
...
If I'm starting a service on Labs that provides data to third-parties,
What exactly do you mean here?  That's a loaded term and possibly against
the labs privacy policy depending on what you mean.
An API, Dan ;)
Ok, so ... usage of the API is what you're after, I think piwik is probably
the best solution.
...
...
...
what would analytics recommend my easiest path is to getting request
logs into Hadoop?
Weighing everything on balance, right now I'd say adding your name to the
piwik supporters.  So far, off the top of my head, that list is:

wikimedia store
annual report
the entire reading vertical
russian wikimedia chapter (most likely all other chapters would chime

in
...
supporting it)

a bunch of labs projects (including wikimetrics, vital signs, various

dashboards, etc.)
How is piwik linked to Hadoop? I'm not asking "how do we visualise the
data" I'm asking how we get it into the cluster in the first place.
I think for the most part, piwik would handle reporting and crunching
numbers for you and get you some basic reports.  But if we wanted to crunch
tons of data, we could integrate it with hadoop somehow.
I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did
not happen).

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"