[Analytics] "If it didn't happen in HDFS, it didn't happen"

10 Jun 2015

Hey all,

We're building a lot of tools out on Labs. From a RESTful API to a
Wikidata Query Service, we're making neat things and Labs is proving
the perfect place to prototype them - in all-but-one-respects.

A crucial part of these tools being not just useful but measurably
useful is the logs being available to parse. When you combine that
with the constraints on getting things on to do the main cluster, what
you have is a situation where much of our beta or alpha software has
no integration with our existing data storage systems but absolutely
/needs/ it to verify that it's worth keeping and provide data about
usage.

So I'm asking, I guess, two things. The first is: can we have a firm
commitment that we'll get this kind of stuff into Hadoop? Right now we
have a RESTful API everywhere that is not (to my knowledge) throwing
data into the request logs. We have a WDQS that isn't either.
Undoubtedly we have other tools I haven't encountered. It's paramount
that the first question we ask with new services or systems is "so
when does new traffic data start hitting the analytics cluster?"

Second: what's best practices for this? What resources are available?
If I'm starting a service on Labs that provides data to third-parties,
what would analytics recommend my easiest path is to getting request
logs into Hadoop?

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

[Analytics] "If it didn't happen in HDFS, it didn't happen"