I met with Nuria and some of her folks a couple of weeks ago to sync up on things that the Analytics team is planning on for FY18/19 that have specific overlap with the Cloud Services environment. The TL;DR is that they are planning on provisioning servers to expose a some of their Data Lake (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake) data to the public via Cloud Services. There will be some help that they need in rolling this out, but the expectation is that most of the work will be handled by their team.
Here are my raw notes from that meeting:
=== Public data lake === * wiki replicas not good for research due to lack of normalization * data lake provides a normalized data set ** current data lake system is a bunch of hive tables ** looking at presto + sql layer or similar tech for cloud exposure ** dataset is all public * first publish the data on new bare metal hardware * next work on adapting a tool (quarry?) to be able to get to that data easier * had some data quality issues that kept this from happening in sooner * Start in Q2; hope that the hard work has already been done * Aim for a Q3 launch * Bryan brought up possibility of wanting named user accounts for tracking who is using the service
=== Asks from others to analytics ==== * folks from Audiences asking for data sooner, but that data needs to be built from public replicas ** replicas get so busy that building the data set takes longer ** More wiki replica hardware might help, but if not reserved capacity then all other usage would grow to fill available compute power ** no plan to fix this yet, but aware of the problem
=== Future plans === * 3 node Kafka cluster for eventstream data
Bryan
cloud-admin@lists.wikimedia.org