HmMmm.
here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters in labs. We probably should! I don’t really want to maintain them myself, but they should be pretty easy to set up using hiera now. I could maintain them if no on else wants to.
Thought two:
"so
when does new traffic data start hitting the analytics cluster?”
If it is HTTP requests from varnish you are looking for, this will for the most part just happen, unless the varnish cluster serving the requests is different than the usual webrequest_sources you are used to seeing. I’m not sure which varnishes RESTbase HTTP is using, but if they aren’t using one of the usual ones we are already importing into HDFS, it would be trivial to set this up.
If I'm starting a service on Labs that provides data to third-parties,
what would analytics recommend my easiest path is to getting request
logs into Hadoop?
We can’t do this into directly into production Analytics Cluster, since labs is firewalled off from production networks. However, a service like this would be intended to move to production eventually, yes? If so, then perhaps a beta Analytics Cluster would allow you to develop the methods needed to get data into Hadoop in Labs. Then the move into production would be simpler and already have Analytics Cluster support.
2. Event Logging. We're making this scale arbitrarily by moving it to Kafka. Once that's done, we should be able to instrument pretty much anything with Event Logging
Dan, I’d like to not promise anything here at the moment. I think this effort will significantly increase our throughput, but I’m not willing to blame arbitrary scale. Unless we figure out a way to farm out and parallelize eventlogging processors in an easy way, scaling eventlogging even with Kafka to big data sizes will be cumbersome and manual.
Eventually I’d like to have a system that is bound by hardware and not architecture, but that is not well defined and still a long way off. We will see.
But, Dan is right, eventlogging might be a good way to labs data into production Analytics Cluster, since any client can log via HTTP POSTs. We aren’t currently importing eventlogging data into the Analytcs Cluster, but one of the points of the almost finished eventlogging-kafka is to get this data into Hadoop, so that should happen soon.
The commitment has to be made on both sides. The teams building the services have to instrument them,
Agree. If you want HTTP requests to your services and those HTTP requests go through varnish, this will be very easy. If you want anything beyond that, the service developers will have to implement it.
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics