[Wikistats 2.0] [Regular Update] Wrapping up Q1 - Analytics

17 Sep 2016

We're starting to wrap up Q1, so it's time for another wikistats update.
First, a quick reminder:

-----
If you currently use the existing reports, PLEASE give feedback in the
section(s) at
https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpRepor
ts/Future_per_report
Bonus points for noting what you use, how you use it, and explaining what
elements you most appreciate or might want added.
-----

Ok, so this is our list of high level goals, and as we were saying before,
we're focusing on taking a vertical slice through 4, 5, and 6 so we can
deliver functionality and iterate.

1. [done] Build pipeline to process and analyze *pageview* data
2. [done] Load pageview data into an *API*
3. [        ] *Sanitize* pageview data with more dimensions for public
consumption
4. [        ] Build pipeline to process and analyze *editing* data
5. [        ] Load editing data into an *API*
6. [        ] *Sanitize* editing data for public consumption
7. [        ] *Design* UI to organize dashboards built around new data
8. [        ] Build enough *dashboards* to replace the main functionality
of stats.wikipedia.org
9. [        ] Officially Replace stats.wikipedia.org with *(maybe)
analytics.wikipedia.org
<http://analytics.wikipedia.org/>*
***. [         ] Bonus: *replace dumps generation* based on the new data
pipelines

So here's the progress since last time by high level goal:

4. We can rebuild most all page and user histories from logging, revision,
page, archive, and user mediawiki tables.  The scala / spark algorithm
scales well and can process english wikipedia in less than an hour.  Once
history is rebuilt, we want to join it into a denormalized schema.  We have
an algorithm that works on simplewiki rather quickly, but we're *still
working on scaling* it to work with english wiki.  For that reason, our
vertical slice this quarter may include *only simplewiki*.  In addition to
denormalizing the data to make it very simple for analysts and researchers
to work with, we're also computing columns like "this edit was reverted at
X timestamp" or "this page was deleted at X timestamp".  These will all be
available in one flat schema.

5. We loaded the simplewiki data into Druid and put Pivot on top of it.
It's fantastically fun, I had to close that tab or I would've lost a day
browsing around.  For a small db like simplewiki, Druid should have no
problem maintaining an updated version of the computed columns mentioned
above.  (I say updated because "this edit was reverted" is a fact that can
change from false to true at some point in the future).  We're still not
100% sure whether Druid can do that with the much larger enwiki data, but
we're testing that.  And we're also testing ClickHouse, another highly
performant OLAP big data columnar store, just in case.  In short, we can
update *once a week* already, and we're working on seeing how feasible it
is to update more often than that.

6. We ran into a *problem* when thinking about sanitizing the data.  Our
initial idea was to filter out the same columns that are filtered out when
data is replicated to labsdb.  But we found rows are also filtered and the
process for doing that filtering is in need of a lot of love and care.  So
we may side-track to see if we can help out our fellow DBAs and labs ops in
the process, maybe unifying the edit data sanitization.

Steps remaining for having simplewiki data in Druid / Pivot by the end of
Q1:
* vet data with Erik
* finish productionizing our Pivot install so internal/NDA folks can play
with it