I met with Nuria and some of her folks a couple of weeks ago to sync
up on things that the Analytics team is planning on for FY18/19 that
have specific overlap with the Cloud Services environment. The TL;DR
is that they are planning on provisioning servers to expose a some of
their Data Lake
(<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>) data to
the public via Cloud Services. There will be some help that they need
in rolling this out, but the expectation is that most of the work will
be handled by their team.
Here are my raw notes from that meeting:
=== Public data lake ===
* wiki replicas not good for research due to lack of normalization
* data lake provides a normalized data set
** current data lake system is a bunch of hive tables
** looking at presto + sql layer or similar tech for cloud exposure
** dataset is all public
* first publish the data on new bare metal hardware
* next work on adapting a tool (quarry?) to be able to get to that data easier
* had some data quality issues that kept this from happening in sooner
* Start in Q2; hope that the hard work has already been done
* Aim for a Q3 launch
* Bryan brought up possibility of wanting named user accounts for
tracking who is using the service
=== Asks from others to analytics ====
* folks from Audiences asking for data sooner, but that data needs to
be built from public replicas
** replicas get so busy that building the data set takes longer
** More wiki replica hardware might help, but if not reserved capacity
then all other usage would grow to fill available compute power
** no plan to fix this yet, but aware of the problem
=== Future plans ===
* 3 node Kafka cluster for eventstream data
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA
irc: bd808 v:415.839.6885 x6855
2018-07-04 20:00:02,385 INFO force is enabled
2018-07-04 20:00:02,446 INFO removing misc-project-backup
2018-07-04 20:00:02,554 INFO removing misc-project-backup
2018-07-04 20:00:03,044 INFO creating misc-project-backup at 2T
2018-07-04 20:00:04,148 INFO force is enabled
2018-07-04 20:00:04,203 INFO removing misc-snap
2018-07-04 20:00:04,282 INFO removing misc-snap
2018-07-04 20:00:04,755 INFO creating misc-snap at 1T
2018-07-03 20:00:03,213 INFO force is enabled
2018-07-03 20:00:03,258 INFO removing tools-project-backup
2018-07-03 20:00:03,359 INFO removing tools-project-backup
2018-07-03 20:00:03,956 INFO creating tools-project-backup at 2T
2018-07-03 20:00:04,921 INFO force is enabled
2018-07-03 20:00:04,984 INFO removing tools-snap
2018-07-03 20:00:05,046 INFO removing tools-snap
2018-07-03 20:00:06,892 INFO creating tools-snap at 1T
I attended the SRE meeting yesterday as part of my clinic-duty/on-call week.
* Hiring: 2 people on the final stage of the hiring pipeline. I can give
more details.
* Running a diversity and inclusion survey (I'm not sure by who)
* A reminder of expense reports, mind the new FY!
* Phab vandalism: discussion of measures, releng is taking care for now.
* Summary of the dumps incident by Ariel.
* Q1 goals were a big chunk of the meeting. Nothing seems to have a
direct WMCS dependency or actionable in our side.
Some goals were discussed, in many cases they were already mentioned in
Prague:
* PHP7 & mediawiki
* K8s (aka. deployment pipeline)
* DB backups improvements
* Netbox vs racktables
* Central certificate management for LE
* Increasing networking capacity
* Datacenter switchover. Some concerns: it requires a lot of time,
coordination, commitment...
Further discussion is probably required to choose concrete goals, as
Mark pointed out that doing all of them is probably too much work.