RFC: Analytics Reboot - Analytics

13 Feb 2013

Howdy,

After having spent some time reviewing the analytics github repo and
playing observer to the quarterly review last December, and today's
security/architecture mixup, I have a few opinions and suggestions that I'd
like to share.  They may upset some or step on toes.  Sorry about that.

Main suggestion - all logging, etl, storage, and compute infrastructure
should be owned, implemented, and maintained by the operations team.  There
should be a clear set of deliverables for ops: the entirety of the current
udp stream ingested, processed via an extensible etl layer with a minimum
of IP anonymization in place, and stored in hdfs in a standardized format
with logical access controls.  Technology and implementation choices should
ultimately rest with ops so long as all deliverables are met, though
external advice and assistance (including from industry experts outside of
wmf) will be welcome and solicited.

The analytics team owns everything above this.  Is pig the best tool to
analyze log data in hdfs?  Does hive make sense for some things?  Want to
add and analyze wiki revisions via map reduce jobs?  Visualize everything
imaginable?  Add more sophisticated transforms to the etl pipeline?  Go,
go, analytics!

I see the work accomplished to date under the heading of kraken as falling
into three categories:

1) Data querying.  This includes pig integration, repeatable queries run
via pig, and ad hoc map reduce jobs meant to analyze data written by folks
like Diederik.  While modifications may be needed if there are changes to
how data is stored in hdfs (such as file name conventions or format) or to
access controls, this category of work isn't tied to infrastructure details
and should be reusable on any generic hadoop implementation containing wmf
log data.

2) Devops work.  This includes everything Andrew Otto has done to puppetize
various pieces of the existing infrastructure.  I'd consider all of this
experimental.  Some might be reusable, some may need refactoring, some
should be chalked up as a learning exercise and abandoned.  Even if the
majority was to fall under that last category, this has undoubtedly been a
valuable learning experience.  Were Andrew to join the ops team and
collaborate with others on a from scratch implementation (let's say I'd
prefer us using the beta branch of actual apache hadoop instead of
cloudera), I'm sure the experience he's gained to date will be of use to
all.

3) Bound for mordor.  Never happened, never to be spoken of again.  This
includes things like the map reduce job executed via cron to transfer data
from kafka to hdfs, and... oh wait, never happened, never to be spoken of
again.

Unless I'm missing anything major, I don't see any reasons not to pursue
this new approach, nor does it appear that any significant amount of work
would be lost.  Instead, the most useful bits (category 1) should still be
useful.  And since that seems to be where analytics has been most
successful, perhaps it makes sense to let them focus fully on this sort of
thing instead of infrastructure.

-Asher