[discovery] mediawiki -> kafka -> hadoop -> hive data pipeline up and running

2 Nov 2015


      I've been too busy to move this forward in the last few weeks but finally
found some time to deploy what we had been working on.   This pipeline is
now up, running and queryable from hive. Its sampling 1:1000 right now as i
didn't want a flood of errors if it went wrong, but based on the success so
far will be dropping the sampling so it captures everything our old logs
did. For the time being we will continue logging CirrusSearchRequests and
CirrusSearchUserTesting to fluorine (and rsync to stat1002 for processing)
but that can be turned off once we move any existing data processing over
to hive.
Very exciting!
There are still a few minor things to figure out, my first edition of the
table in hive doesn't handle the external partitioning right but will fix
that soon enough.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

[discovery] mediawiki -> kafka -> hadoop -> hive data pipeline up and running