Cross posting. See follow up discussion on analytics list web archive. 

---------- Forwarded message ----------
From: Dan Andreescu <dandreescu@wikimedia.org>
Date: Friday, June 5, 2015
Subject: [Analytics] Pageview API Status update
To: Analytics List <analytics@lists.wikimedia.org>


I just posted a comment on the famous task: https://phabricator.wikimedia.org/T44259#1341010 :)

Here it is for those who would rather discuss on this list:


We have finished analyzing the intermediate hourly aggregate with all the columns that we think are interesting.  The data is too large to query and anonymize in real time.  We'd rather get an API out faster than deal with that problem, so we decided to produce smaller "cubes" [1] of data for specific purposes.  We have two cubes in mind and I'll explain those here.  For each cube, we're aiming to have:

* Direct access to a postgresql database in labs with the data
* API access through RESTBase
* Mondrian / Saiku access in labs for dimensional analysis
* Data will be pre-aggregated so that any single data point has k-anonymity (we have not determined a good k yet)
* Higher level aggregations will be pre-computed so they use all data

And, the cubes are:

**stats.grok.se Cube: basic pageview data**

Hourly resolution.  Will serve the same purpose as stats.grok.se has served for so many years.  The dimensions available will be:

* project - 'Project name from requests host name'
* dialect - 'Dialect from requests path (not set if present in project name)'
* page_title - 'Page Title from requests path and query'
* access_method - 'Method used to access the pages, can be desktop, mobile web, or mobile app'
* is_zero - 'accessed through a zero provider'
* agent_type - 'Agent accessing the pages, can be spider or user'
* referer_class - 'Can be internal, external or unknown'


**Geo Cube: geo-coded pageview data**

Daily resolution.  Will allow researchers to track the flu, breaking news, etc.  Dimensions will be:

* project - 'Project name from requests hostname'
* page_title - 'Page Title from requests path and query'
* country_code - 'Country ISO code of the accessing agents (computed using MaxMind GeoIP database)'
* province - 'State / Province of the accessing agents (computed using MaxMind GeoIP database)'
* city - 'Metro area of the accessing agents (computed using MaxMind GeoIP database)'


So, if anyone wants another cube, **now** is the time to speak up.  We'll probably add cubes later, but it may be a while.

[1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube