Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I work
to make data that WMF releases about reading, editing, and other on-wiki
behavior safer, more granular, and more accessible to the world using
differential
privacy <https://en.wikipedia.org/wiki/Differential_privacy>.
Today I’m reaching out to share that WMF has released almost 8 years (from
1 July 2015 to present) of privatized pageview data
<https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsing-habits-while-protecting-users/>,
partitioned by country, project, and page. This data is significantly more
granular than other datasets we release, and should help researchers to
disambiguate both long- and short-term trends within languages on a
country-by-country basis — several
<https://phabricator.wikimedia.org/T207171> long-standing requests
<https://phabricator.wikimedia.org/T267283> from Wikimedia communities.
Due to various technical factors, there are three distinct datasets:
-
1 July 2015 – 8 Feb 2017
<https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/>
/ README
<https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/00_README.html>
(publishing threshold [1]: 3,500 pageviews)
-
9 Feb 2017 – 5 Feb 2023
<https://analytics.wikimedia.org/published/datasets/country_project_page_historical/>
/ README
<https://analytics.wikimedia.org/published/datasets/country_project_page_historical/00_README.html>
(publishing threshold: 450 pageviews)
-
6 Feb 2023 – present
<https://analytics.wikimedia.org/published/datasets/country_project_page/>
/ README
<https://analytics.wikimedia.org/published/datasets/country_project_page/00_README.html>
(publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In the
interim, I’ve built an example python notebook
<https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb>
illustrating how one might access the data in its current csv format, as
well as several different kinds of simple analyses that can be done with it.
I also want to invite the analytics community to join me for a brief demo
of this project at the July Research Showcase
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase>. In the
meantime, please feel free to reach out with any questions on the project talk
page <https://meta.wikimedia.org/wiki/Talk:Differential_privacy>.
For more information about WMF’s work on differential privacy more
generally, see the differential privacy homepage on meta
<https://meta.wikimedia.org/wiki/Differential_privacy>. And in the future,
look for more announcements of privatized datasets on editor behavior,
on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the dataset in
order to be published.
Show replies by date