We're starting to wrap up Q1, so it's time for another wikistats update. First, a quick reminder:

-----

If you currently use the existing reports, PLEASE give feedback in the section(s) at
https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_report

Bonus points for noting what you use, how you use it, and explaining what elements you most appreciate or might want added.

-----

Ok, so this is our list of high level goals, and as we were saying before, we're focusing on taking a vertical slice through 4, 5, and 6 so we can deliver functionality and iterate.

1. [done] Build pipeline to process and analyze pageview data

2. [done] Load pageview data into an API

3. [ ] Sanitize pageview data with more dimensions for public consumption

4. [ ] Build pipeline to process and analyze editing data

5. [ ] Load editing data into an API

6. [ ] Sanitize editing data for public consumption

7. [ ] Design UI to organize dashboards built around new data

8. [ ] Build enough dashboards to replace the main functionality of stats.wikipedia.org

9. [ ] Officially Replace stats.wikipedia.org with (maybe) analytics.wikipedia.org

*. [ ] Bonus: replace dumps generation based on the new data pipelines

So here's the progress since last time by high level goal:

4. We can rebuild most all page and user histories from logging, revision, page, archive, and user mediawiki tables. The scala / spark algorithm scales well and can process english wikipedia in less than an hour. Once history is rebuilt, we want to join it into a denormalized schema. We have an algorithm that works on simplewiki rather quickly, but we're still working on scaling it to work with english wiki. For that reason, our vertical slice this quarter may include only simplewiki. In addition to denormalizing the data to make it very simple for analysts and researchers to work with, we're also computing columns like "this edit was reverted at X timestamp" or "this page was deleted at X timestamp". These will all be available in one flat schema.

5. We loaded the simplewiki data into Druid and put Pivot on top of it. It's fantastically fun, I had to close that tab or I would've lost a day browsing around. For a small db like simplewiki, Druid should have no problem maintaining an updated version of the computed columns mentioned above. (I say updated because "this edit was reverted" is a fact that can change from false to true at some point in the future). We're still not 100% sure whether Druid can do that with the much larger enwiki data, but we're testing that. And we're also testing ClickHouse, another highly performant OLAP big data columnar store, just in case. In short, we can update once a week already, and we're working on seeing how feasible it is to update more often than that.

6. We ran into a problem when thinking about sanitizing the data. Our initial idea was to filter out the same columns that are filtered out when data is replicated to labsdb. But we found rows are also filtered and the process for doing that filtering is in need of a lot of love and care. So we may side-track to see if we can help out our fellow DBAs and labs ops in the process, maybe unifying the edit data sanitization.

Steps remaining for having simplewiki data in Druid / Pivot by the end of Q1:

* vet data with Erik

* finish productionizing our Pivot install so internal/NDA folks can play with it