We're starting to wrap up Q1, so it's time for another wikistats update. First, a quick reminder:
----- If you currently use the existing reports, PLEASE give feedback in the section(s) at https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpRepor ts/Future_per_report Bonus points for noting what you use, how you use it, and explaining what elements you most appreciate or might want added. -----
Ok, so this is our list of high level goals, and as we were saying before, we're focusing on taking a vertical slice through 4, 5, and 6 so we can deliver functionality and iterate.
1. [done] Build pipeline to process and analyze *pageview* data 2. [done] Load pageview data into an *API* 3. [ ] *Sanitize* pageview data with more dimensions for public consumption 4. [ ] Build pipeline to process and analyze *editing* data 5. [ ] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org http://analytics.wikipedia.org/* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines
So here's the progress since last time by high level goal:
4. We can rebuild most all page and user histories from logging, revision, page, archive, and user mediawiki tables. The scala / spark algorithm scales well and can process english wikipedia in less than an hour. Once history is rebuilt, we want to join it into a denormalized schema. We have an algorithm that works on simplewiki rather quickly, but we're *still working on scaling* it to work with english wiki. For that reason, our vertical slice this quarter may include *only simplewiki*. In addition to denormalizing the data to make it very simple for analysts and researchers to work with, we're also computing columns like "this edit was reverted at X timestamp" or "this page was deleted at X timestamp". These will all be available in one flat schema.
5. We loaded the simplewiki data into Druid and put Pivot on top of it. It's fantastically fun, I had to close that tab or I would've lost a day browsing around. For a small db like simplewiki, Druid should have no problem maintaining an updated version of the computed columns mentioned above. (I say updated because "this edit was reverted" is a fact that can change from false to true at some point in the future). We're still not 100% sure whether Druid can do that with the much larger enwiki data, but we're testing that. And we're also testing ClickHouse, another highly performant OLAP big data columnar store, just in case. In short, we can update *once a week* already, and we're working on seeing how feasible it is to update more often than that.
6. We ran into a *problem* when thinking about sanitizing the data. Our initial idea was to filter out the same columns that are filtered out when data is replicated to labsdb. But we found rows are also filtered and the process for doing that filtering is in need of a lot of love and care. So we may side-track to see if we can help out our fellow DBAs and labs ops in the process, maybe unifying the edit data sanitization.
Steps remaining for having simplewiki data in Druid / Pivot by the end of Q1: * vet data with Erik * finish productionizing our Pivot install so internal/NDA folks can play with it