Analytics Security Review Meeting Notes - Analytics

13 Feb 2013

Hi everybody, here's a quick summary of notes / take-aways from the
Analytics (Kraken) security review meeting.

* analytics1001 has been wiped and reimaged (restoring /home from backup)
* All proxies and externally-facing services have been disabled.
* Work is under way to bring everything that was puppetized under the
analytics1001 puppetmaster into operations-puppet after proper review.
Andrew is working closely with a number of people in ops to make this
happen.
* All future deployments to the cluster will be puppetized and go through
normal code review. Other than performance testing, these puppet confs will
be tested in labs.
* The rest of the cluster will be wiped and reimaged out of puppet; data in
HDFS will be preserved. This can be a rolling process allowing work to
proceed while its under way.
* Schedule an Architectural Review meeting sometime during the SF Ops
hackathon, including a look at  additional services and auth methods that
provide access to internal dashboards like Hue &such.
* Ensure all current "application" code (stuff written by WMF) gets
reviewed:
** Cron doing HDFS import from Kafka
** Pig UDFs and other data tools used in processing
** Future: Storm ETL layer

We all agreed the overall goal is to get to an acceptable security state.
During that process, the Analytics team still needs to continue to meet
stakeholder needs and deliver on promises. We decided on keeping running a
"minimum viable cluster" while reimaging boxes and civilizing cluster
configuration:

* Wall off some portion of the boxes to continue recieving data and running
jobs; all other boxes can be wiped (preserving HDFS partitions). Boxes
would be incrementally removed from the "unsanitary" cluster, reimaged, and
then added to the "sanitary" cluster. Stupid bathroom-related jokes to be
avoided.
* Team Analytics to enumerate data processing jobs that will be running in
the intermediate period; their configurations and tooling will be reviewed.
* Analytics and Ops engineers continue to have shell access. Jobs can be
submitted and managed using the CLI tools; internal dashboards can be
accessed via SSH tunnelling. Analysts working on the cluster will be
approved for shell access on a case-by-case basis (afaik, just Evan Rosen
(full-time analyst for Grantmaking & Programs), and Stefan Petrea
(contractor for Analytics)). If more analysts desire access in the interim,
we can work it out in a case-by-case basis.
* No public, external access of any box in either zone (including proxied,
dashboards like Hue, or even static files) that hasn't gone through review.
* Analytics and Ops will work together to find a simple, acceptable
mechanism for data export.

== Next Steps ==

* Analytics puppet manfiests fully reviewed and merged into master
operations-puppet repository
** Andrew to come pow-wow before the SF Ops hackathon and buddy it up with
ops to plow through some of this.
* Schedule Architecture Review
* Rolling reimaging of all analytics boxes (including hadoop data nodes but
preserving data) implementing this "minimal viable cluster" plan.

Questions very welcome! It's entirely possible I've missed things or
mistranslated them.

Cheers,
Team Analytics

--
David Schoonover
dsc(a)wikimedia.org