Logging Infrastructure ToDos - Analytics

4 Sep 2013

I just spent some time playing with Hive and JSON today, and I think I finally have a
grasp on all of the items and questions that are left to make this actually happen. 
I'm writing them down here to summarize for you and for my own brain :)

- varnishkafka
-- compression support (snappy?)
-- puppet module
-- local puppetization (with our JSON logging format nailed down).
-- Packaged and installed on mobile hosts via puppet.

- Kafka 0.8 Brokers
-- 0.8 package in apt.wikimedia.org (Alex K is going to do this for me soon).
-- Repave analytics1021 and analytics1022, install Kafka brokers via puppet.

- Camus/ETL
-- Figure out how to deploy and run this:
   Shadow Jar?  Puppetized cronjob?  Oozie?
-- If needed, implement geocoding and anonymization as part of
   Camus ETL phase.  This could also be done as an after the fact Pig or MR
   job scheduled by oozie.
-- Do Hadoop compression settings automatically work when writing
   to HDFS from Camus?

- Hive
-- How do we properly deploy and use hive-serdes-1.0-SNAPSHOT.jar?
-- Determine proper webrequest Hive schema based on final
   varnishkafka JSON log format.  Put this in Kraken repo somewhere?
-- Write oozie job for creating Hive partitions after Camus imports.

:)