cc-ing our friends in research and wikitech (sorry I forgot initially)

We're happy to announce a few improvements to Analytics data releases on dumps.wikimedia.org:

* We are releasing a new dataset, an estimate of Unique Devices accessing our projects [1]
* We are officially making available a better Pageviews dataset [2]
* We are deprecating two older pageview statistics datasets
* We moved Analytics data from /other to /analytics [3]

Details follow:


Unique Devices: Since 2009, the Wikimedia Foundation used comScore to report data about unique web visitors.  In January 2016, however, we decided to stop reporting comScore numbers [4] because of certain limitations in the methodology, these limitations translated into misreported mobile usage. We are now ready to replace comscore numbers with the Unique Devices Dataset [5][1]. While unique devices does not equal unique visitors, it is a good proxy for that metric, meaning that a major increase in the number of unique devices is likely to come from an increase in distinct users. We understand that counting uniques raises fairly big privacy concerns and we use a very private conscious way to count unique devices, it does not include any cookie by which your browser history can be tracked [6].

We invite you to explore this new dataset and hope it’s helpful for the Wikimedia community in better understanding our projects. This data can help measurethe reach of wikimedia projects on the web.

Pageviews: This [2] is the best quality data available for counting the number of pageviews our projects receive at the article and project level.  We've upgraded from pagecounts-raw to pagecounts-all-sites, and now to pageviews, in order to filter out more spider traffic and measure something closer to what we think is a real user viewing content.  A short history might be useful:

    * pagecounts-raw: was maintained by Domas Mituzas originally and taken over by the analytics team.  It was and still is the most used dataset, though it has some majore problems.  It does not count access to the mobile site, it does not filter out spider or bot traffic, and it suffers from unknown loss due to logging infrastructure limitations.
    * pagecounts-all-sites: uses the same pageview definition as pagecounts-raw, and so also does not filter out spider or bot traffic.  But it does include access to mobile and zero sites, and is built on a more reliable logging infrastructure.
    * pagecounts-ez: is derived from the best data available at the time.  So until December 2015, it was based on pagecounts-raw and pagecounts-all-sites, and now it's based on pageviews.  This dataset is great because it compresses very large files without losing any information, still providing hourly page and project level statistics.

So the new dataset, pageviews, is what's behind our pageview API and is now available in static files for bulk download back to May 2015.  But the multiple ways to download pageview data is confusing for consumers, so we're keeping only pageviews and pagecounts-ez and deprecating the other two.  If you'd like to read more about the current pageview definition, details are on the research page [7].

Deprecating: We are deprecating the pagecounts-raw and pagecounts-all-sites datasets in May 2016 (discussion here: https://phabricator.wikimedia.org/T130656 ).  This data suffers from many artifacts, lack of mobile data, and/or infrastructure problems, and so is not comparable to the new way we track pageviews.  It will remain here because we have historical data that may be useful, but it will not be maintained or updated beyond May 2016.

Clean-up: Analytics data on dumps was crammed into /other with unrelated datasets.  We made a new page to receive current and future datasets [3] and linked to it from /other and /.  Please let us know if anything there looks confusing or opaque and I'll be happy to clarify.

[2] http://dumps.wikimedia.org/other/pageviews