I just spent some time playing with Hive and JSON today, and I think I finally have a grasp on all of the items and questions that are left to make this actually happen. I'm writing them down here to summarize for you and for my own brain :)
- varnishkafka -- compression support (snappy?) -- puppet module -- local puppetization (with our JSON logging format nailed down). -- Packaged and installed on mobile hosts via puppet.
- Kafka 0.8 Brokers -- 0.8 package in apt.wikimedia.org (Alex K is going to do this for me soon). -- Repave analytics1021 and analytics1022, install Kafka brokers via puppet.
- Camus/ETL -- Figure out how to deploy and run this: Shadow Jar? Puppetized cronjob? Oozie? -- If needed, implement geocoding and anonymization as part of Camus ETL phase. This could also be done as an after the fact Pig or MR job scheduled by oozie. -- Do Hadoop compression settings automatically work when writing to HDFS from Camus?
- Hive -- How do we properly deploy and use hive-serdes-1.0-SNAPSHOT.jar? -- Determine proper webrequest Hive schema based on final varnishkafka JSON log format. Put this in Kraken repo somewhere? -- Write oozie job for creating Hive partitions after Camus imports.
:)