Hey all,
We're building a lot of tools out on Labs. From a RESTful API to a Wikidata Query Service, we're making neat things and Labs is proving the perfect place to prototype them - in all-but-one-respects.
A crucial part of these tools being not just useful but measurably useful is the logs being available to parse. When you combine that with the constraints on getting things on to do the main cluster, what you have is a situation where much of our beta or alpha software has no integration with our existing data storage systems but absolutely /needs/ it to verify that it's worth keeping and provide data about usage.
So I'm asking, I guess, two things. The first is: can we have a firm commitment that we'll get this kind of stuff into Hadoop? Right now we have a RESTful API everywhere that is not (to my knowledge) throwing data into the request logs. We have a WDQS that isn't either. Undoubtedly we have other tools I haven't encountered. It's paramount that the first question we ask with new services or systems is "so when does new traffic data start hitting the analytics cluster?"
Second: what's best practices for this? What resources are available? If I'm starting a service on Labs that provides data to third-parties, what would analytics recommend my easiest path is to getting request logs into Hadoop?