Analytics plans for FY18/19 related to Cloud Services - Cloud-admin

6 Jul 2018


      I met with Nuria and some of her folks a couple of weeks ago to sync
up on things that the Analytics team is planning on for FY18/19 that
have specific overlap with the Cloud Services environment. The TL;DR
is that they are planning on provisioning servers to expose a some of
their Data Lake
(https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake) data to
the public via Cloud Services. There will be some help that they need
in rolling this out, but the expectation is that most of the work will
be handled by their team.
Here are my raw notes from that meeting:
=== Public data lake ===
* wiki replicas not good for research due to lack of normalization
* data lake provides a normalized data set
** current data lake system is a bunch of hive tables
** looking at presto + sql layer or similar tech for cloud exposure
** dataset is all public
* first publish the data on new bare metal hardware
* next work on adapting a tool (quarry?) to be able to get to that data easier
* had some data quality issues that kept this from happening in sooner
* Start in Q2; hope that the hard work has already been done
* Aim for a Q3 launch
* Bryan brought up possibility of wanting named user accounts for
tracking who is using the service
=== Asks from others to analytics ====
* folks from Audiences asking for data sooner, but that data needs to
be built from public replicas
** replicas get so busy that building the data set takes longer
** More wiki replica hardware might help, but if not reserved capacity
then all other usage would grow to fill available compute power
** no plan to fix this yet, but aware of the problem
=== Future plans ===
* 3 node Kafka cluster for eventstream data
Bryan
-- 
Bryan Davis              Wikimedia Foundation    bd808@wikimedia.org
[[m:User:BDavis_(WMF)]] Manager, Technical Engagement    Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855