Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I work to make data that WMF releases about reading, editing, and other on-wiki behavior safer, more granular, and more accessible to the world using differential privacy https://en.wikipedia.org/wiki/Differential_privacy.
Today I’m reaching out to share that WMF has released almost 8 years (from 1 July 2015 to present) of privatized pageview data https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsing-habits-while-protecting-users/, partitioned by country, project, and page. This data is significantly more granular than other datasets we release, and should help researchers to disambiguate both long- and short-term trends within languages on a country-by-country basis — several https://phabricator.wikimedia.org/T207171 long-standing requests https://phabricator.wikimedia.org/T267283 from Wikimedia communities.
Due to various technical factors, there are three distinct datasets:
-
1 July 2015 – 8 Feb 2017 https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/ / README https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/00_README.html (publishing threshold [1]: 3,500 pageviews) -
9 Feb 2017 – 5 Feb 2023 https://analytics.wikimedia.org/published/datasets/country_project_page_historical/ / README https://analytics.wikimedia.org/published/datasets/country_project_page_historical/00_README.html (publishing threshold: 450 pageviews) -
6 Feb 2023 – present https://analytics.wikimedia.org/published/datasets/country_project_page/ / README https://analytics.wikimedia.org/published/datasets/country_project_page/00_README.html (publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In the interim, I’ve built an example python notebook https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb illustrating how one might access the data in its current csv format, as well as several different kinds of simple analyses that can be done with it.
I also want to invite the analytics community to join me for a brief demo of this project at the July Research Showcase https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase. In the meantime, please feel free to reach out with any questions on the project talk page https://meta.wikimedia.org/wiki/Talk:Differential_privacy.
For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta https://meta.wikimedia.org/wiki/Differential_privacy. And in the future, look for more announcements of privatized datasets on editor behavior, on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the dataset in order to be published.
analytics-announce@lists.wikimedia.org