Hello world!


My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I work to make data that WMF releases about reading, editing, and other on-wiki behavior safer, more granular, and more accessible to the world using differential privacy.


Today I’m reaching out to share that WMF has released almost 8 years (from 1 July 2015 to present) of privatized pageview data, partitioned by country, project, and page. This data is significantly more granular than other datasets we release, and should help researchers to disambiguate both long- and short-term trends within languages on a country-by-country basis — several long-standing requests from Wikimedia communities.


Due to various technical factors, there are three distinct datasets:


API access to this data should be coming in the next few months. In the interim, I’ve built an example python notebook illustrating how one might access the data in its current csv format, as well as several different kinds of simple analyses that can be done with it.


I also want to invite the analytics community to join me for a brief demo of this project at the July Research Showcase. In the meantime, please feel free to reach out with any questions on the project talk page.


For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta. And in the future, look for more announcements of privatized datasets on editor behavior, on-wiki search, centralnotice impressions and clicks, and more.


Best,

Hal


[1] “Publishing threshold” is the minimum value of a row in the dataset in order to be published.