I've been too busy to move this forward in the last few weeks but finally found some time to deploy what we had been working on. This pipeline is now up, running and queryable from hive. Its sampling 1:1000 right now as i didn't want a flood of errors if it went wrong, but based on the success so far will be dropping the sampling so it captures everything our old logs did. For the time being we will continue logging CirrusSearchRequests and CirrusSearchUserTesting to fluorine (and rsync to stat1002 for processing) but that can be turned off once we move any existing data processing over to hive.

Very exciting!

There are still a few minor things to figure out, my first edition of the table in hive doesn't handle the external partitioning right but will fix that soon enough.