Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

10 Jun 2015

HmMmm.

here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters in labs.  We
probably should!  I don’t really want to maintain them myself, but they should be pretty
easy to set up using hiera now.  I could maintain them if no on else wants to.

Thought two:

...
  "so
 when does new traffic data start hitting the analytics cluster?” If it is HTTP
requests from varnish you are looking for, this will for the most part just happen, unless
the varnish cluster serving the requests is different than the usual webrequest_sources
you are used to seeing.  I’m not sure which varnishes RESTbase HTTP is using, but if they
aren’t using one of the usual ones we are already importing into HDFS, it would be trivial
to set this up.

...
  If I'm starting a service on Labs that provides
data to third-parties,
 what would analytics recommend my easiest path is to getting request
 logs into Hadoop? 
We can’t do this into directly into production Analytics Cluster, since labs is firewalled
off from production networks.  However, a service like this would be intended to move to
production eventually, yes?  If so, then perhaps a beta Analytics Cluster would allow you
to develop the methods needed to get data into Hadoop in Labs.  Then the move into
production would be simpler and already have Analytics Cluster support.

...
  2. Event Logging.  We're making this scale
arbitrarily by moving it to Kafka.  Once that's done, we should be able to instrument
pretty much anything with Event Logging Dan, I’d like to not promise anything here
at the moment.  I think this effort will significantly increase our throughput, but I’m
not willing to blame arbitrary scale.  Unless we figure out a way to farm out and
parallelize eventlogging processors in an easy way, scaling eventlogging even with Kafka
to big data sizes will be cumbersome and manual.

Eventually I’d like to have a system that is bound by hardware and not architecture, but
that is not well defined and still a long way off.  We will see.

But, Dan is right, eventlogging might be a good way to labs data into production Analytics
Cluster, since any client can log via HTTP POSTs.  We aren’t currently importing
eventlogging data into the Analytcs Cluster, but one of the points of the almost finished
eventlogging-kafka is to get this data into Hadoop, so that should happen soon.

...
  The commitment has to be made on both sides.  The
teams building the services have to instrument them, Agree.  If you want HTTP
requests to your services and those HTTP requests go through varnish, this will be very
easy.  If you want anything beyond that, the service developers will have to implement
it.

...
  On Jun 10, 2015, at 08:35, Dan Andreescu
&lt;dandreescu(a)wikimedia.org&gt; wrote:

 On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes &lt;okeyes(a)wikimedia.org
<mailto:okeyes@wikimedia.org>> wrote:
 On 10 June 2015 at 10:53, Dan Andreescu &lt;dandreescu(a)wikimedia.org
<mailto:dandreescu@wikimedia.org>> wrote:
  I see three ways for data to get into the
cluster:

 1. request stream, handled already, we're working on ways to pump the data
 back out through APIs  
 Awesome, and it'd end up in the Hadoop cluster in a table? How...do we
 kick that off most easily?

 Nono, I mean our specific web request stream.  I don't think there's any way to
piggyback onto that for arbitrary other services.  This is not an option for you, it's
just a way that data gets into the cluster, for completeness.

   Second:
what's best practices for this? What resources are available?
 If I'm starting a service on Labs that provides data to third-parties, 

 What exactly do you mean here?  That's a loaded term and possibly against
 the labs privacy policy depending on what you mean.

 An API, Dan ;)

 Ok, so ... usage of the API is what you're after, I think piwik is probably the best
solution.

 what would analytics recommend my easiest path is to getting request
 logs into Hadoop? 

 Weighing everything on balance, right now I'd say adding your name to the
 piwik supporters.  So far, off the top of my head, that list is:

 * wikimedia store
 * annual report
 * the entire reading vertical
 * russian wikimedia chapter (most likely all other chapters would chime in
 supporting it)
 * a bunch of labs projects (including wikimetrics, vital signs, various
 dashboards, etc.)

 How is piwik linked to Hadoop? I'm not asking "how do we visualise the
 data" I'm asking how we get it into the cluster in the first place.

 I think for the most part, piwik would handle reporting and crunching numbers for you and
get you some basic reports.  But if we wanted to crunch tons of data, we could integrate
it with hadoop somehow.

 I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did not
happen).
 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"