lists.wikimedia.org
Sign In
Sign Up
Sign In
Sign Up
Manage this list
×
Keyboard Shortcuts
Thread View
j
: Next unread message
k
: Previous unread message
j a
: Jump to all threads
j l
: Jump to MailingList overview
2024
October
September
August
July
June
May
April
March
February
January
2023
December
November
October
September
August
July
June
May
April
March
February
January
2022
December
November
October
September
August
July
June
May
April
March
February
January
2021
December
November
October
September
August
July
June
May
April
March
February
January
2020
December
November
October
September
List overview
Download
Analytics-announce
June 2023
----- 2024 -----
October 2024
September 2024
August 2024
July 2024
June 2024
May 2024
April 2024
March 2024
February 2024
January 2024
----- 2023 -----
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
----- 2022 -----
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
----- 2021 -----
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
----- 2020 -----
December 2020
November 2020
October 2020
September 2020
analytics-announce@lists.wikimedia.org
1 participants
1 discussions
Start a n
N
ew thread
New private, granular pageview dataset
by Hal Triedman
21 Jun '23
21 Jun '23
Hello world! My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I work to make data that WMF releases about reading, editing, and other on-wiki behavior safer, more granular, and more accessible to the world using differential privacy <
https://en.wikipedia.org/wiki/Differential_privacy
>. Today I’m reaching out to share that WMF has released almost 8 years (from 1 July 2015 to present) of privatized pageview data <
https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsi…
>, partitioned by country, project, and page. This data is significantly more granular than other datasets we release, and should help researchers to disambiguate both long- and short-term trends within languages on a country-by-country basis — several <
https://phabricator.wikimedia.org/T207171
> long-standing requests <
https://phabricator.wikimedia.org/T267283
> from Wikimedia communities. Due to various technical factors, there are three distinct datasets: - 1 July 2015 – 8 Feb 2017 <
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
> / README <
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
> (publishing threshold [1]: 3,500 pageviews) - 9 Feb 2017 – 5 Feb 2023 <
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
> / README <
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
> (publishing threshold: 450 pageviews) - 6 Feb 2023 – present <
https://analytics.wikimedia.org/published/datasets/country_project_page/
> / README <
https://analytics.wikimedia.org/published/datasets/country_project_page/00_…
> (publishing threshold: 90 pageviews) API access to this data should be coming in the next few months. In the interim, I’ve built an example python notebook <
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
> illustrating how one might access the data in its current csv format, as well as several different kinds of simple analyses that can be done with it. I also want to invite the analytics community to join me for a brief demo of this project at the July Research Showcase <
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
>. In the meantime, please feel free to reach out with any questions on the project talk page <
https://meta.wikimedia.org/wiki/Talk:Differential_privacy
>. For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta <
https://meta.wikimedia.org/wiki/Differential_privacy
>. And in the future, look for more announcements of privatized datasets on editor behavior, on-wiki search, centralnotice impressions and clicks, and more. Best, Hal [1] “Publishing threshold” is the minimum value of a row in the dataset in order to be published.
1
0
0
0
Results per page:
10
25
50
100
200