Great dataset! This is amazing. I have no doubt that this will enable a lot
of new research endeavors.
If I may have a suggestion: is it possible to also have wikidata id for
each row? That way we can more conveniently match the same concepts across
languages at large scale...
Best,
Kai Zhu
Assistant Professor at Bocconi University
On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman <htriedman(a)wikimedia.org>
wrote:
Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I work
to make data that WMF releases about reading, editing, and other on-wiki
behavior safer, more granular, and more accessible to the world using
differential
privacy <https://en.wikipedia.org/wiki/Differential_privacy>.
Today I’m reaching out to share that WMF has released almost 8 years (from
1 July 2015 to present) of privatized pageview data
<
https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsi…
,
partitioned by country, project, and page. This data is significantly more
granular than other datasets we release, and should help researchers to
disambiguate both long- and short-term trends within languages on a
country-by-country basis — several
<https://phabricator.wikimedia.org/T207171> long-standing requests
<https://phabricator.wikimedia.org/T267283> from Wikimedia communities.
Due to various technical factors, there are three distinct datasets:
-
1 July 2015 – 8 Feb 2017
<
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
/ README
<
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
(publishing threshold [1]: 3,500 pageviews)
-
9 Feb 2017 – 5 Feb 2023
<
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
/ README
<
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
(publishing threshold: 450 pageviews)
-
6 Feb 2023 – present
<
https://analytics.wikimedia.org/published/datasets/country_project_page/>
/ README
<
https://analytics.wikimedia.org/published/datasets/country_project_page/00_…
(publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In the
interim, I’ve built an example python notebook
<
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
illustrating how one might access the data in its current csv format, as
well as several different kinds of simple analyses that can be done with
it.
I also want to invite the research community to join me for a brief demo of
this project at the July Research Showcase
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase>. In the
meantime, please feel free to reach out with any questions on the project
talk
page <https://meta.wikimedia.org/wiki/Talk:Differential_privacy>.
For more information about WMF’s work on differential privacy more
generally, see the differential privacy homepage on meta
<https://meta.wikimedia.org/wiki/Differential_privacy>. And in the future,
look for more announcements of privatized datasets on editor behavior,
on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the dataset in
order to be published.
_______________________________________________
Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
To unsubscribe send an email to wiki-research-l-leave(a)lists.wikimedia.org