"If it didn't happen in HDFS, it didn't happen"

List overview All Threads
Download

newer

older

Re: [Analytics]...

Monthly compressed traffic delay

Oliver Keyes

10 Jun 2015 10 Jun '15

4:44 p.m.

Hey all,

We're building a lot of tools out on Labs. From a RESTful API to a Wikidata Query Service, we're making neat things and Labs is proving the perfect place to prototype them - in all-but-one-respects.

A crucial part of these tools being not just useful but measurably useful is the logs being available to parse. When you combine that with the constraints on getting things on to do the main cluster, what you have is a situation where much of our beta or alpha software has no integration with our existing data storage systems but absolutely /needs/ it to verify that it's worth keeping and provide data about usage.

So I'm asking, I guess, two things. The first is: can we have a firm commitment that we'll get this kind of stuff into Hadoop? Right now we have a RESTful API everywhere that is not (to my knowledge) throwing data into the request logs. We have a WDQS that isn't either. Undoubtedly we have other tools I haven't encountered. It's paramount that the first question we ask with new services or systems is "so when does new traffic data start hitting the analytics cluster?"

Second: what's best practices for this? What resources are available? If I'm starting a service on Labs that provides data to third-parties, what would analytics recommend my easiest path is to getting request logs into Hadoop?

-- Oliver Keyes Research Analyst Wikimedia Foundation

Show replies by date

Dan Andreescu

10 Jun 10 Jun

4:53 p.m.

I see three ways for data to get into the cluster:

1. request stream, handled already, we're working on ways to pump the data back out through APIs

2. Event Logging. We're making this scale arbitrarily by moving it to Kafka. Once that's done, we should be able to instrument pretty much anything with Event Logging

3. Piwik. There is a small but growing effort to stand up our own piwik instance so we can get basic canned reports out of the box and not have to reinvent the wheel for every single feature we're trying to instrument and learn about. This could replace a lot of the use cases for Event Logging and free up Event Logging to do more free-form research rather than cookie cutter web analytics.

Answers inline:

So I'm asking, I guess, two things. The first is: can we have a firm

...

commitment that we'll get this kind of stuff into Hadoop? Right now we have a RESTful API everywhere that is not (to my knowledge) throwing data into the request logs. We have a WDQS that isn't either. Undoubtedly we have other tools I haven't encountered. It's paramount that the first question we ask with new services or systems is "so when does new traffic data start hitting the analytics cluster?"

The commitment has to be made on both sides. The teams building the services have to instrument them, picking either 2 or 3 above. And then we'll commit to supporting the path they choose. The piwik path may be slow right now, fair warning.

Second: what's best practices for this? What resources are available?

...

If I'm starting a service on Labs that provides data to third-parties,

What exactly do you mean here? That's a loaded term and possibly against the labs privacy policy depending on what you mean.

...

what would analytics recommend my easiest path is to getting request logs into Hadoop?

Weighing everything on balance, right now I'd say adding your name to the piwik supporters. So far, off the top of my head, that list is:

* wikimedia store * annual report * the entire reading vertical * russian wikimedia chapter (most likely all other chapters would chime in supporting it) * a bunch of labs projects (including wikimetrics, vital signs, various dashboards, etc.)

Oliver Keyes

5:02 p.m.

On 10 June 2015 at 10:53, Dan Andreescu dandreescu@wikimedia.org wrote:

...

I see three ways for data to get into the cluster:

request stream, handled already, we're working on ways to pump the data

back out through APIs

Awesome, and it'd end up in the Hadoop cluster in a table? How...do we kick that off most easily?

...

Event Logging. We're making this scale arbitrarily by moving it to

Kafka. Once that's done, we should be able to instrument pretty much anything with Event Logging

Piwik. There is a small but growing effort to stand up our own piwik

instance so we can get basic canned reports out of the box and not have to reinvent the wheel for every single feature we're trying to instrument and learn about. This could replace a lot of the use cases for Event Logging and free up Event Logging to do more free-form research rather than cookie cutter web analytics.

Answers inline:

...
So I'm asking, I guess, two things. The first is: can we have a firm commitment that we'll get this kind of stuff into Hadoop? Right now we have a RESTful API everywhere that is not (to my knowledge) throwing data into the request logs. We have a WDQS that isn't either. Undoubtedly we have other tools I haven't encountered. It's paramount that the first question we ask with new services or systems is "so when does new traffic data start hitting the analytics cluster?"

The commitment has to be made on both sides. The teams building the services have to instrument them, picking either 2 or 3 above. And then we'll commit to supporting the path they choose. The piwik path may be slow right now, fair warning.

...
Second: what's best practices for this? What resources are available? If I'm starting a service on Labs that provides data to third-parties,

What exactly do you mean here? That's a loaded term and possibly against the labs privacy policy depending on what you mean.

An API, Dan ;)

...

...
what would analytics recommend my easiest path is to getting request logs into Hadoop?

Weighing everything on balance, right now I'd say adding your name to the piwik supporters. So far, off the top of my head, that list is:

wikimedia store

annual report

the entire reading vertical

russian wikimedia chapter (most likely all other chapters would chime in

supporting it)

a bunch of labs projects (including wikimetrics, vital signs, various

dashboards, etc.)

How is piwik linked to Hadoop? I'm not asking "how do we visualise the data" I'm asking how we get it into the cluster in the first place.

...

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

5:35 p.m.

On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

On 10 June 2015 at 10:53, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I see three ways for data to get into the cluster:

request stream, handled already, we're working on ways to pump the

data

...
back out through APIs

Awesome, and it'd end up in the Hadoop cluster in a table? How...do we kick that off most easily?

Nono, I mean our specific web request stream. I don't think there's any way to piggyback onto that for arbitrary other services. This is not an option for you, it's just a way that data gets into the cluster, for completeness.

...

...
Second: what's best practices for this? What resources are available?

...
If I'm starting a service on Labs that provides data to third-parties,

What exactly do you mean here? That's a loaded term and possibly against the labs privacy policy depending on what you mean.

An API, Dan ;)

Ok, so ... usage of the API is what you're after, I think piwik is probably the best solution.

...

...
...
what would analytics recommend my easiest path is to getting request logs into Hadoop?

Weighing everything on balance, right now I'd say adding your name to the piwik supporters. So far, off the top of my head, that list is:

wikimedia store

annual report

the entire reading vertical

russian wikimedia chapter (most likely all other chapters would chime

in

...
supporting it)

a bunch of labs projects (including wikimetrics, vital signs, various

dashboards, etc.)

How is piwik linked to Hadoop? I'm not asking "how do we visualise the data" I'm asking how we get it into the cluster in the first place.

I think for the most part, piwik would handle reporting and crunching numbers for you and get you some basic reports. But if we wanted to crunch tons of data, we could integrate it with hadoop somehow.

I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did not happen).

Andrew Otto

6 p.m.

HmMmm.

here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters in labs. We probably should! I don’t really want to maintain them myself, but they should be pretty easy to set up using hiera now. I could maintain them if no on else wants to.

Thought two:

...

"so when does new traffic data start hitting the analytics cluster?”

If it is HTTP requests from varnish you are looking for, this will for the most part just happen, unless the varnish cluster serving the requests is different than the usual webrequest_sources you are used to seeing. I’m not sure which varnishes RESTbase HTTP is using, but if they aren’t using one of the usual ones we are already importing into HDFS, it would be trivial to set this up.

...

If I'm starting a service on Labs that provides data to third-parties, what would analytics recommend my easiest path is to getting request logs into Hadoop?

We can’t do this into directly into production Analytics Cluster, since labs is firewalled off from production networks. However, a service like this would be intended to move to production eventually, yes? If so, then perhaps a beta Analytics Cluster would allow you to develop the methods needed to get data into Hadoop in Labs. Then the move into production would be simpler and already have Analytics Cluster support.

...

Event Logging. We're making this scale arbitrarily by moving it to Kafka. Once that's done, we should be able to instrument pretty much anything with Event Logging

Dan, I’d like to not promise anything here at the moment. I think this effort will significantly increase our throughput, but I’m not willing to blame arbitrary scale. Unless we figure out a way to farm out and parallelize eventlogging processors in an easy way, scaling eventlogging even with Kafka to big data sizes will be cumbersome and manual.

Eventually I’d like to have a system that is bound by hardware and not architecture, but that is not well defined and still a long way off. We will see.

But, Dan is right, eventlogging might be a good way to labs data into production Analytics Cluster, since any client can log via HTTP POSTs. We aren’t currently importing eventlogging data into the Analytcs Cluster, but one of the points of the almost finished eventlogging-kafka is to get this data into Hadoop, so that should happen soon.

...

The commitment has to be made on both sides. The teams building the services have to instrument them,

Agree. If you want HTTP requests to your services and those HTTP requests go through varnish, this will be very easy. If you want anything beyond that, the service developers will have to implement it.

...

On Jun 10, 2015, at 08:35, Dan Andreescu dandreescu@wikimedia.org wrote:

On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes <okeyes@wikimedia.org mailto:okeyes@wikimedia.org> wrote: On 10 June 2015 at 10:53, Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote:

...
I see three ways for data to get into the cluster:

request stream, handled already, we're working on ways to pump the data

back out through APIs

Awesome, and it'd end up in the Hadoop cluster in a table? How...do we kick that off most easily?

Nono, I mean our specific web request stream. I don't think there's any way to piggyback onto that for arbitrary other services. This is not an option for you, it's just a way that data gets into the cluster, for completeness.

...
...
Second: what's best practices for this? What resources are available? If I'm starting a service on Labs that provides data to third-parties,

What exactly do you mean here? That's a loaded term and possibly against the labs privacy policy depending on what you mean.

An API, Dan ;)

Ok, so ... usage of the API is what you're after, I think piwik is probably the best solution.

...
...
what would analytics recommend my easiest path is to getting request logs into Hadoop?

Weighing everything on balance, right now I'd say adding your name to the piwik supporters. So far, off the top of my head, that list is:

wikimedia store

annual report

the entire reading vertical

russian wikimedia chapter (most likely all other chapters would chime in

supporting it)

a bunch of labs projects (including wikimetrics, vital signs, various

dashboards, etc.)

How is piwik linked to Hadoop? I'm not asking "how do we visualise the data" I'm asking how we get it into the cluster in the first place.

I think for the most part, piwik would handle reporting and crunching numbers for you and get you some basic reports. But if we wanted to crunch tons of data, we could integrate it with hadoop somehow.

I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did not happen). _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

6:09 p.m.

On 10 June 2015 at 12:00, Andrew Otto aotto@wikimedia.org wrote:

...

HmMmm.

here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters in labs. We probably should! I don’t really want to maintain them myself, but they should be pretty easy to set up using hiera now. I could maintain them if no on else wants to.

Thought two:

"so when does new traffic data start hitting the analytics cluster?”

If it is HTTP requests from varnish you are looking for, this will for the most part just happen, unless the varnish cluster serving the requests is different than the usual webrequest_sources you are used to seeing. I’m not sure which varnishes RESTbase HTTP is using, but if they aren’t using one of the usual ones we are already importing into HDFS, it would be trivial to set this up.

If I'm starting a service on Labs that provides data to third-parties, what would analytics recommend my easiest path is to getting request logs into Hadoop?

We can’t do this into directly into production Analytics Cluster, since labs is firewalled off from production networks. However, a service like this would be intended to move to production eventually, yes? If so, then perhaps a beta Analytics Cluster would allow you to develop the methods needed to get data into Hadoop in Labs. Then the move into production would be simpler and already have Analytics Cluster support.

That sounds better than nothing; not perfect, but totally understandable. The impression I'm really getting is "stuff should get off Labs ASAP"

...

Event Logging. We're making this scale arbitrarily by moving it to

Kafka. Once that's done, we should be able to instrument pretty much anything with Event Logging

Dan, I’d like to not promise anything here at the moment. I think this effort will significantly increase our throughput, but I’m not willing to blame arbitrary scale. Unless we figure out a way to farm out and parallelize eventlogging processors in an easy way, scaling eventlogging even with Kafka to big data sizes will be cumbersome and manual.

Eventually I’d like to have a system that is bound by hardware and not architecture, but that is not well defined and still a long way off. We will see.

But, Dan is right, eventlogging might be a good way to labs data into production Analytics Cluster, since any client can log via HTTP POSTs. We aren’t currently importing eventlogging data into the Analytcs Cluster, but one of the points of the almost finished eventlogging-kafka is to get this data into Hadoop, so that should happen soon.

The commitment has to be made on both sides. The teams building the services have to instrument them,

Agree. If you want HTTP requests to your services and those HTTP requests go through varnish, this will be very easy. If you want anything beyond that, the service developers will have to implement it.

On Jun 10, 2015, at 08:35, Dan Andreescu dandreescu@wikimedia.org wrote:

On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
On 10 June 2015 at 10:53, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I see three ways for data to get into the cluster:

request stream, handled already, we're working on ways to pump the

data back out through APIs

Awesome, and it'd end up in the Hadoop cluster in a table? How...do we kick that off most easily?

Nono, I mean our specific web request stream. I don't think there's any way to piggyback onto that for arbitrary other services. This is not an option for you, it's just a way that data gets into the cluster, for completeness.

...
...
...
Second: what's best practices for this? What resources are available? If I'm starting a service on Labs that provides data to third-parties,

What exactly do you mean here? That's a loaded term and possibly against the labs privacy policy depending on what you mean.

An API, Dan ;)

Ok, so ... usage of the API is what you're after, I think piwik is probably the best solution.

...
...
...
what would analytics recommend my easiest path is to getting request logs into Hadoop?

Weighing everything on balance, right now I'd say adding your name to the piwik supporters. So far, off the top of my head, that list is:

wikimedia store

annual report

the entire reading vertical

russian wikimedia chapter (most likely all other chapters would chime

in supporting it)

a bunch of labs projects (including wikimetrics, vital signs, various

dashboards, etc.)

How is piwik linked to Hadoop? I'm not asking "how do we visualise the data" I'm asking how we get it into the cluster in the first place.

I think for the most part, piwik would handle reporting and crunching numbers for you and get you some basic reports. But if we wanted to crunch tons of data, we could integrate it with hadoop somehow.

I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did not happen). _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

6:15 p.m.

I think this thread is a bit too vague. If piwik is woefully inadequate, then what kind of analysis is needed for the use cases you're talking about? It doesn't seem obvious that we need endlessly scalable systems like Hadoop to analyze data gathered by small and fairly limited virtual machines.

I agree with Andrew's Beta Analytics cluster idea, but I think we need to get specific here in order to come up with a good first step.

On Wed, Jun 10, 2015 at 12:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...

On 10 June 2015 at 12:00, Andrew Otto aotto@wikimedia.org wrote:

...
HmMmm.

here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters

in

...
labs. We probably should! I don’t really want to maintain them myself,

but

...
they should be pretty easy to set up using hiera now. I could maintain

them

...
if no on else wants to.

Thought two:

"so when does new traffic data start hitting the analytics cluster?”

If it is HTTP requests from varnish you are looking for, this will for

the

...
most part just happen, unless the varnish cluster serving the requests is different than the usual webrequest_sources you are used to seeing. I’m

not

...
sure which varnishes RESTbase HTTP is using, but if they aren’t using

one of

...
the usual ones we are already importing into HDFS, it would be trivial to set this up.

If I'm starting a service on Labs that provides data to third-parties, what would analytics recommend my easiest path is to getting request logs into Hadoop?

We can’t do this into directly into production Analytics Cluster, since

labs

...
is firewalled off from production networks. However, a service like this would be intended to move to production eventually, yes? If so, then perhaps a beta Analytics Cluster would allow you to develop the methods needed to get data into Hadoop in Labs. Then the move into production

would

...
be simpler and already have Analytics Cluster support.

That sounds better than nothing; not perfect, but totally understandable. The impression I'm really getting is "stuff should get off Labs ASAP"

...

Event Logging. We're making this scale arbitrarily by moving it to

Kafka. Once that's done, we should be able to instrument pretty much anything with Event Logging

Dan, I’d like to not promise anything here at the moment. I think this effort will significantly increase our throughput, but I’m not willing to blame arbitrary scale. Unless we figure out a way to farm out and parallelize eventlogging processors in an easy way, scaling eventlogging even with Kafka to big data sizes will be cumbersome and manual.

Eventually I’d like to have a system that is bound by hardware and not architecture, but that is not well defined and still a long way off. We will see.

But, Dan is right, eventlogging might be a good way to labs data into production Analytics Cluster, since any client can log via HTTP POSTs.

We

...
aren’t currently importing eventlogging data into the Analytcs Cluster,

but

...
one of the points of the almost finished eventlogging-kafka is to get

this

...
data into Hadoop, so that should happen soon.

The commitment has to be made on both sides. The teams building the services have to instrument them,

Agree. If you want HTTP requests to your services and those HTTP

requests

...
go through varnish, this will be very easy. If you want anything beyond that, the service developers will have to implement it.

On Jun 10, 2015, at 08:35, Dan Andreescu dandreescu@wikimedia.org

wrote:

...
On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes okeyes@wikimedia.org

wrote:

...
...
On 10 June 2015 at 10:53, Dan Andreescu dandreescu@wikimedia.org

wrote:

...
...
...
I see three ways for data to get into the cluster:

request stream, handled already, we're working on ways to pump the

data back out through APIs

Awesome, and it'd end up in the Hadoop cluster in a table? How...do we kick that off most easily?

Nono, I mean our specific web request stream. I don't think there's any

way

...
to piggyback onto that for arbitrary other services. This is not an

option

...
for you, it's just a way that data gets into the cluster, for

completeness.

...
...
...
...
Second: what's best practices for this? What resources are available? If I'm starting a service on Labs that provides data to

third-parties,

...
...
...
What exactly do you mean here? That's a loaded term and possibly against the labs privacy policy depending on what you mean.

An API, Dan ;)

Ok, so ... usage of the API is what you're after, I think piwik is

probably

...
the best solution.

...
...
...
what would analytics recommend my easiest path is to getting request logs into Hadoop?

Weighing everything on balance, right now I'd say adding your name to the piwik supporters. So far, off the top of my head, that list is:

wikimedia store

annual report

the entire reading vertical

russian wikimedia chapter (most likely all other chapters would

chime

...
...
...
in supporting it)

a bunch of labs projects (including wikimetrics, vital signs,

various

...
...
...
dashboards, etc.)

How is piwik linked to Hadoop? I'm not asking "how do we visualise the data" I'm asking how we get it into the cluster in the first place.

I think for the most part, piwik would handle reporting and crunching numbers for you and get you some basic reports. But if we wanted to

crunch

...
tons of data, we could integrate it with hadoop somehow.

I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did

not

...
happen). _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

6:26 p.m.

At the moment I don't have specific questions because we're trying to just get the thing set up. But, wider context and a prediction:

The budget this year has ensured, at least for Discovery, that ops and hardware support are slashed to the bone. Because of this we're deploying bigger and bigger things on Labs - I wouldn't describe Wikidata Query Service as a "small and fairly limited virtual machine" - because there we actually have machines (sure, virtual ones. but machines. There is hardware). This isn't going to stop until people have resourcing to throw them out on production. So from where I'm sitting it looks like the options are "no integration around analytics" or "stop building anything until you have machines in prod for it".

I don't want to give AnEng a ton of work but neither of these options seem particularly appealing, particularly since I have a mandate to /get/ analytics for those things we're building. And not having a cluster on labs, or cluster access from labs, doesn't remove the headache, it just shifts it downstream, because now every analyst generating metrics from these services has to integrate an entirely new set of things into their workflows.

On 10 June 2015 at 12:15, Dan Andreescu dandreescu@wikimedia.org wrote:

...

I think this thread is a bit too vague. If piwik is woefully inadequate, then what kind of analysis is needed for the use cases you're talking about? It doesn't seem obvious that we need endlessly scalable systems like Hadoop to analyze data gathered by small and fairly limited virtual machines.

I agree with Andrew's Beta Analytics cluster idea, but I think we need to get specific here in order to come up with a good first step.

On Wed, Jun 10, 2015 at 12:09 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
On 10 June 2015 at 12:00, Andrew Otto aotto@wikimedia.org wrote:

...
HmMmm.

here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters in labs. We probably should! I don’t really want to maintain them myself, but they should be pretty easy to set up using hiera now. I could maintain them if no on else wants to.

Thought two:

"so when does new traffic data start hitting the analytics cluster?”

If it is HTTP requests from varnish you are looking for, this will for the most part just happen, unless the varnish cluster serving the requests is different than the usual webrequest_sources you are used to seeing. I’m not sure which varnishes RESTbase HTTP is using, but if they aren’t using one of the usual ones we are already importing into HDFS, it would be trivial to set this up.

If I'm starting a service on Labs that provides data to third-parties, what would analytics recommend my easiest path is to getting request logs into Hadoop?

We can’t do this into directly into production Analytics Cluster, since labs is firewalled off from production networks. However, a service like this would be intended to move to production eventually, yes? If so, then perhaps a beta Analytics Cluster would allow you to develop the methods needed to get data into Hadoop in Labs. Then the move into production would be simpler and already have Analytics Cluster support.

That sounds better than nothing; not perfect, but totally understandable. The impression I'm really getting is "stuff should get off Labs ASAP"

...

Event Logging. We're making this scale arbitrarily by moving it to

Kafka. Once that's done, we should be able to instrument pretty much anything with Event Logging

Dan, I’d like to not promise anything here at the moment. I think this effort will significantly increase our throughput, but I’m not willing to blame arbitrary scale. Unless we figure out a way to farm out and parallelize eventlogging processors in an easy way, scaling eventlogging even with Kafka to big data sizes will be cumbersome and manual.

Eventually I’d like to have a system that is bound by hardware and not architecture, but that is not well defined and still a long way off. We will see.

But, Dan is right, eventlogging might be a good way to labs data into production Analytics Cluster, since any client can log via HTTP POSTs. We aren’t currently importing eventlogging data into the Analytcs Cluster, but one of the points of the almost finished eventlogging-kafka is to get this data into Hadoop, so that should happen soon.

The commitment has to be made on both sides. The teams building the services have to instrument them,

Agree. If you want HTTP requests to your services and those HTTP requests go through varnish, this will be very easy. If you want anything beyond that, the service developers will have to implement it.

On Jun 10, 2015, at 08:35, Dan Andreescu dandreescu@wikimedia.org wrote:

On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
On 10 June 2015 at 10:53, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I see three ways for data to get into the cluster:

request stream, handled already, we're working on ways to pump the

data back out through APIs

Awesome, and it'd end up in the Hadoop cluster in a table? How...do we kick that off most easily?

Nono, I mean our specific web request stream. I don't think there's any way to piggyback onto that for arbitrary other services. This is not an option for you, it's just a way that data gets into the cluster, for completeness.

...
...
...
Second: what's best practices for this? What resources are available? If I'm starting a service on Labs that provides data to third-parties,

What exactly do you mean here? That's a loaded term and possibly against the labs privacy policy depending on what you mean.

An API, Dan ;)

Ok, so ... usage of the API is what you're after, I think piwik is probably the best solution.

...
...
...
what would analytics recommend my easiest path is to getting request logs into Hadoop?

Weighing everything on balance, right now I'd say adding your name to the piwik supporters. So far, off the top of my head, that list is:

wikimedia store

annual report

the entire reading vertical

russian wikimedia chapter (most likely all other chapters would

chime in supporting it)

a bunch of labs projects (including wikimetrics, vital signs,

various dashboards, etc.)

How is piwik linked to Hadoop? I'm not asking "how do we visualise the data" I'm asking how we get it into the cluster in the first place.

I think for the most part, piwik would handle reporting and crunching numbers for you and get you some basic reports. But if we wanted to crunch tons of data, we could integrate it with hadoop somehow.

I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did not happen). _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Pine W

8:56 p.m.

Question about "the budget this year has ensured, at least for Discovery, that ops and hardware support are slashed to the bone." I'm trying to figure out the paradox of hiring more peope for Discovery at the same time that ops and hardware support are reduced. Can someone explain?

Thanks, Pine

Oliver Keyes

9:23 p.m.

Probably, on the Discovery team mailing list.

On 10 June 2015 at 14:56, Pine W wiki.pine@gmail.com wrote:

...

Question about "the budget this year has ensured, at least for Discovery, that ops and hardware support are slashed to the bone." I'm trying to figure out the paradox of hiring more peope for Discovery at the same time that ops and hardware support are reduced. Can someone explain?

Thanks, Pine

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

25 Jun 25 Jun

3:56 p.m.

Update on this:

* Piwik is not finding a lot of love. The readership team is working on puppetizing it and we theoretically have hardware to run it, but we haven't decided it's a good idea for Analytics to support this yet. * We're a (bit?) more optimistic about parallel Event Logging processors. Last we spoke Madhu was going to try and modify the eventlogging_processor code to allow this.

In short, the best bet for getting data into HDFS right now might be to make an EL schema and wait for us to move it to Kafka transport.

On Wed, Jun 10, 2015 at 3:23 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Probably, on the Discovery team mailing list.

On 10 June 2015 at 14:56, Pine W wiki.pine@gmail.com wrote:

...
Question about "the budget this year has ensured, at least for Discovery, that ops and hardware support are slashed to the bone." I'm trying to

figure

...
out the paradox of hiring more peope for Discovery at the same time that

ops

...
and hardware support are reduced. Can someone explain?

Thanks, Pine

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

7:01 p.m.

Gotcha. And we can put EL on labs?

On 25 June 2015 at 09:56, Dan Andreescu dandreescu@wikimedia.org wrote:

...

Update on this:

Piwik is not finding a lot of love. The readership team is working on

puppetizing it and we theoretically have hardware to run it, but we haven't decided it's a good idea for Analytics to support this yet.

We're a (bit?) more optimistic about parallel Event Logging processors.

Last we spoke Madhu was going to try and modify the eventlogging_processor code to allow this.

In short, the best bet for getting data into HDFS right now might be to make an EL schema and wait for us to move it to Kafka transport.

On Wed, Jun 10, 2015 at 3:23 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
Probably, on the Discovery team mailing list.

On 10 June 2015 at 14:56, Pine W wiki.pine@gmail.com wrote:

...
Question about "the budget this year has ensured, at least for Discovery, that ops and hardware support are slashed to the bone." I'm trying to figure out the paradox of hiring more peope for Discovery at the same time that ops and hardware support are reduced. Can someone explain?

Thanks, Pine

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Dan Andreescu

7:05 p.m.

Theoretically we should be able to request the Event Logging endpoint URI from anywhere. But I don't know how CORS is set up on that endpoint after this recent change.

On Thu, Jun 25, 2015 at 1:01 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Gotcha. And we can put EL on labs?

On 25 June 2015 at 09:56, Dan Andreescu dandreescu@wikimedia.org wrote:

...
Update on this:

Piwik is not finding a lot of love. The readership team is working on

puppetizing it and we theoretically have hardware to run it, but we

haven't

...
decided it's a good idea for Analytics to support this yet.

We're a (bit?) more optimistic about parallel Event Logging processors.

Last we spoke Madhu was going to try and modify the

eventlogging_processor

...
code to allow this.

In short, the best bet for getting data into HDFS right now might be to

make

...
an EL schema and wait for us to move it to Kafka transport.

On Wed, Jun 10, 2015 at 3:23 PM, Oliver Keyes okeyes@wikimedia.org

wrote:

...
...
Probably, on the Discovery team mailing list.

On 10 June 2015 at 14:56, Pine W wiki.pine@gmail.com wrote:

...
Question about "the budget this year has ensured, at least for Discovery, that ops and hardware support are slashed to the bone." I'm trying to figure out the paradox of hiring more peope for Discovery at the same time

that

...
...
...
ops and hardware support are reduced. Can someone explain?

Thanks, Pine

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Oliver Keyes

10 Jun 10 Jun

6:07 p.m.

On 10 June 2015 at 11:35, Dan Andreescu dandreescu@wikimedia.org wrote:

...

On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
On 10 June 2015 at 10:53, Dan Andreescu dandreescu@wikimedia.org wrote:

...
I see three ways for data to get into the cluster:

request stream, handled already, we're working on ways to pump the

data back out through APIs

Awesome, and it'd end up in the Hadoop cluster in a table? How...do we kick that off most easily?

Nono, I mean our specific web request stream. I don't think there's any way to piggyback onto that for arbitrary other services. This is not an option for you, it's just a way that data gets into the cluster, for completeness.

...
...
...
Second: what's best practices for this? What resources are available? If I'm starting a service on Labs that provides data to third-parties,

What exactly do you mean here? That's a loaded term and possibly against the labs privacy policy depending on what you mean.

An API, Dan ;)

Ok, so ... usage of the API is what you're after, I think piwik is probably the best solution.

It's not. I've used Piwik before many times and it's not what we're looking for. My question is "how do I get the request logs into HDFS?" Your answer is a piece of software that, last time I checked, required JS executed on the client machine and will put the data in yet another service that can't be tightly integrated with our dashboards in the same way.

We should have a way of doing this. If there is genuinely no way of getting requests from labs varnish instances into HDFS, we need to either (a) develop that or (b) stop using Labs for any kind of beta release.

...

...
...
...
what would analytics recommend my easiest path is to getting request logs into Hadoop?

Weighing everything on balance, right now I'd say adding your name to the piwik supporters. So far, off the top of my head, that list is:

wikimedia store

annual report

the entire reading vertical

russian wikimedia chapter (most likely all other chapters would chime

in supporting it)

a bunch of labs projects (including wikimetrics, vital signs, various

dashboards, etc.)

How is piwik linked to Hadoop? I'm not asking "how do we visualise the data" I'm asking how we get it into the cluster in the first place.

I think for the most part, piwik would handle reporting and crunching numbers for you and get you some basic reports. But if we wanted to crunch tons of data, we could integrate it with hadoop somehow.

I'm kind of challenging IIDNHIHIDNH (If it did not happen in HDFS it did not happen).

Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

-- Oliver Keyes Research Analyst Wikimedia Foundation

3464

Age (days ago)

3479

Last active (days ago)

analytics@lists.wikimedia.org

13 comments

4 participants

tags (0)

participants (4)

Andrew Otto
Dan Andreescu
Oliver Keyes
Pine W