I think this thread is a bit too vague. If piwik is woefully inadequate,
then what kind of analysis is needed for the use cases you're talking
about? It doesn't seem obvious that we need endlessly scalable systems
like Hadoop to analyze data gathered by small and fairly limited virtual
machines.
I agree with Andrew's Beta Analytics cluster idea, but I think we need to
get specific here in order to come up with a good first step.
On Wed, Jun 10, 2015 at 12:09 PM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
On 10 June 2015 at 12:00, Andrew Otto
<aotto(a)wikimedia.org> wrote:
HmMmm.
here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters
in
labs. We probably should! I don’t really want
to maintain them myself,
but
they should be pretty easy to set up using hiera
now. I could maintain
them
if no on else wants to.
Thought two:
"so
when does new traffic data start hitting the analytics cluster?”
If it is HTTP requests from varnish you are looking for, this will for
the
most part just happen, unless the varnish cluster
serving the requests is
different than the usual webrequest_sources you are used to seeing. I’m
not
sure which varnishes RESTbase HTTP is using, but
if they aren’t using
one of
the usual ones we are already importing into
HDFS, it would be trivial to
set this up.
If I'm starting a service on Labs that provides data to third-parties,
what would analytics recommend my easiest path is to getting request
logs into Hadoop?
We can’t do this into directly into production Analytics Cluster, since
labs
is firewalled off from production networks.
However, a service like this
would be intended to move to production eventually, yes? If so, then
perhaps a beta Analytics Cluster would allow you to develop the methods
needed to get data into Hadoop in Labs. Then the move into production
would
be simpler and already have Analytics Cluster
support.
That sounds better than nothing; not perfect, but totally
understandable. The impression I'm really getting is "stuff should get
off Labs ASAP"
2. Event Logging. We're making this scale arbitrarily by moving it to
Kafka. Once that's done, we should be able to instrument pretty much
anything with Event Logging
Dan, I’d like to not promise anything here at the moment. I think this
effort will significantly increase our throughput, but I’m not willing to
blame arbitrary scale. Unless we figure out a way to farm out and
parallelize eventlogging processors in an easy way, scaling eventlogging
even with Kafka to big data sizes will be cumbersome and manual.
Eventually I’d like to have a system that is bound by hardware and not
architecture, but that is not well defined and still a long way off. We
will see.
But, Dan is right, eventlogging might be a good way to labs data into
production Analytics Cluster, since any client can log via HTTP POSTs.
We
aren’t currently importing eventlogging data into
the Analytcs Cluster,
but
one of the points of the almost finished
eventlogging-kafka is to get
this
data into Hadoop, so that should happen soon.
The commitment has to be made on both sides. The teams building the
services have to instrument them,
Agree. If you want HTTP requests to your services and those HTTP
requests
go through varnish, this will be very easy. If
you want anything beyond
that, the service developers will have to implement it.
On Jun 10, 2015, at 08:35, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes <okeyes(a)wikimedia.org>
wrote:
>
> On 10 June 2015 at 10:53, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
I see three ways for data to get into the cluster:
1. request stream, handled already, we're working on ways to pump the
data
back out through APIs
Awesome, and it'd end up in the Hadoop cluster in a table? How...do we
kick that off most easily?
Nono, I mean our specific web request stream. I don't think there's any
way
to piggyback onto that for arbitrary other
services. This is not an
option
for you, it's just a way that data gets into
the cluster, for
completeness.
> >> Second: what's best practices for this? What resources are available?
> >> If I'm starting a service on Labs that provides data to
third-parties,
What exactly do you mean here? That's a loaded term and possibly
against
the labs privacy policy depending on what you mean.
An API, Dan ;)
Ok, so ... usage of the API is what you're after, I think piwik is
probably
the best solution.
>
> >>
> >> what would analytics recommend my easiest path is to getting request
> >> logs into Hadoop?
> >
> >
> > Weighing everything on balance, right now I'd say adding your name to
> > the
> > piwik supporters. So far, off the top of my head, that list is:
> >
> > * wikimedia store
> > * annual report
> > * the entire reading vertical
> > * russian wikimedia chapter (most likely all other chapters would
chime
> > in
> > supporting it)
> > * a bunch of labs projects (including wikimetrics, vital signs,
various
dashboards, etc.)
How is piwik linked to Hadoop? I'm not asking "how do we visualise the
data" I'm asking how we get it into the cluster in the first place.
I think for the most part, piwik would handle reporting and crunching
numbers for you and get you some basic reports. But if we wanted to
crunch
tons of data, we could integrate it with hadoop
somehow.
I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did
not
happen).
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics