Asher,
Howdy,
After having spent some time reviewing the analytics github repo and playing observer to the quarterly review last December, and today's security/architecture mixup, I have a few opinions and suggestions that I'd like to share. They may upset some or step on toes. Sorry about that.
Main suggestion - all logging, etl, storage, and compute infrastructure should be owned, implemented, and maintained by the operations team. There should be a clear set of deliverables for ops: the entirety of the current udp stream ingested, processed via an extensible etl layer with a minimum of IP anonymization in place, and stored in hdfs in a standardized format with logical access controls. Technology and implementation choices should ultimately rest with ops so long as all deliverables are met, though external advice and assistance (including from industry experts outside of wmf) will be welcome and solicited.
The analytics team owns everything above this. Is pig the best tool to analyze log data in hdfs? Does hive make sense for some things? Want to add and analyze wiki revisions via map reduce jobs? Visualize everything imaginable? Add more sophisticated transforms to the etl pipeline? Go, go, analytics!
I see the work accomplished to date under the heading of kraken as falling into three categories:
1) Data querying. This includes pig integration, repeatable queries run via pig, and ad hoc map reduce jobs meant to analyze data written by folks like Diederik. While modifications may be needed if there are changes to how data is stored in hdfs (such as file name conventions or format) or to access controls, this category of work isn't tied to infrastructure details and should be reusable on any generic hadoop implementation containing wmf log data.
2) Devops work. This includes everything Andrew Otto has done to puppetize various pieces of the existing infrastructure. I'd consider all of this experimental. Some might be reusable, some may need refactoring, some should be chalked up as a learning exercise and abandoned. Even if the majority was to fall under that last category, this has undoubtedly been a valuable learning experience. Were Andrew to join the ops team and collaborate with others on a from scratch implementation (let's say I'd prefer us using the beta branch of actual apache hadoop instead of cloudera), I'm sure the experience he's gained to date will be of use to all.
3) Bound for mordor. Never happened, never to be spoken of again. This includes things like the map reduce job executed via cron to transfer data from kafka to hdfs, and... oh wait, never happened, never to be spoken of again.
Unless I'm missing anything major, I don't see any reasons not to pursue this new approach, nor does it appear that any significant amount of work would be lost. Instead, the most useful bits (category 1) should still be useful. And since that seems to be where analytics has been most successful, perhaps it makes sense to let them focus fully on this sort of thing instead of infrastructure.
-Asher
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics