Hi guys!

The Analytics team had the Kraken Arch Review with Mark and Faidon a couple of weeks ago.  I wanted to summarize a few things here so that everyone was aware of the status of the analytics nodes.

We defined 3 phases that the cluster has to go through before it is considered production cool.

1. Minimally Viable Cluster
This is what we have now, described at http://www.mediawiki.org/wiki/Analytics/Kraken/Overview.  analytics1001 has been reinstalled, but the other machines are still running unpuppetized Kraken stuff.  The Analytics team has deliverables for this month.  Reinstalling all of these nodes and repuppetizing (with review) before then would slow us down too much.  analytics1010 (the Hadoop NameNode) access has been restricted with iptables, and Mark plans to set up network ACLs to restrict access from analytics nodes to the rest of the cluster soon (See: https://rt.wikimedia.org/Ticket/Display.html?id=4433 ).


2. Initial Base Cluster
This is basically what we have now, but fully reviewed and puppetized.  This is a transitional phase.  This will not include Storm for ETL, and probably won't include using Kafka from the frontends.  All analytics machines will be reinstalled before we consider this phase complete.  We hope to get here in the next couple of months.  


3. Production Cluster
This is the ideal setup, including Storm, frontend Kafka Producers, Avro serialized data, fully automated Oozie jobs, etc. etc.



In the meantime, there may be weird non-puppetized things on the remaining analytics nodes.  (I'm referring to Leslie's email about the extra apt sources).  If you notice anything like this, please don't hesitate to ask me, its probably something that isn't being used anyway.


Thanks all!
-Ao