1. [done] Build pipeline to process and analyze pageview data
2. [done] Load pageview data into an API
3. [ ] Sanitize pageview data with more dimensions for public consumption
4. [ beta] Build pipeline to process and analyze editing data
5. [ beta] Load editing data into an API
6. [ ] Sanitize editing data for public consumption
7. [ ] Design UI to organize dashboards built around new data
*. [ ] Bonus: replace dumps generation based on the new data pipelines
4 & 5. Since our last update, we've finished the pipeline that imports data from mediawiki databases, cleans it up as best as possible, reshapes it in a analytics-friendly way, and makes it easily queryable. I'm marking these goals as "beta" because we're still tweaking the algorithm for performance and productionizing the jobs. This will be completed early next quarter, but in the meantime we have data for people to play with internally. Sadly we haven't sanitized it yet so we can't publish it. For those with internal access:
* In hive, you can access this data in the wmf database, the tables are:
- wmf.mediawiki_history: denormalized full history with
this schema - wmf.mediawiki_page_history: the sequence of states of each wiki page (
schema)
- wmf.mediawiki_user_history: the sequence of states of each user account (
schema)
6. Sanitizing has not moved forward, as we need DBA time and they've been overloaded. We will attempt to restart this effort in Q3.
7. We have begun the design process, we'll share more about this as we go.
Our goals and planning for next quarter support us finishing 4, 5, 7, and 8, so basically putting a UI on top of the data pipeline we have in place, and updating it weekly. We also hope to have good progress on 6, but that depends on collaboration with the DBA team and is harder than we originally imagined.