I just spent some time playing with Hive and JSON today, and I think I finally have a
grasp on all of the items and questions that are left to make this actually happen.
I'm writing them down here to summarize for you and for my own brain :)
- varnishkafka
-- compression support (snappy?)
-- puppet module
-- local puppetization (with our JSON logging format nailed down).
-- Packaged and installed on mobile hosts via puppet.
- Kafka 0.8 Brokers
-- 0.8 package in
apt.wikimedia.org (Alex K is going to do this for me soon).
-- Repave analytics1021 and analytics1022, install Kafka brokers via puppet.
- Camus/ETL
-- Figure out how to deploy and run this:
Shadow Jar? Puppetized cronjob? Oozie?
-- If needed, implement geocoding and anonymization as part of
Camus ETL phase. This could also be done as an after the fact Pig or MR
job scheduled by oozie.
-- Do Hadoop compression settings automatically work when writing
to HDFS from Camus?
- Hive
-- How do we properly deploy and use hive-serdes-1.0-SNAPSHOT.jar?
-- Determine proper webrequest Hive schema based on final
varnishkafka JSON log format. Put this in Kraken repo somewhere?
-- Write oozie job for creating Hive partitions after Camus imports.
:)