Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I work to make data that WMF releases about reading, editing, and other on-wiki behavior safer, more granular, and more accessible to the world using differential privacy https://en.wikipedia.org/wiki/Differential_privacy.
Today I’m reaching out to share that WMF has released almost 8 years (from 1 July 2015 to present) of privatized pageview data https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsing-habits-while-protecting-users/, partitioned by country, project, and page. This data is significantly more granular than other datasets we release, and should help researchers to disambiguate both long- and short-term trends within languages on a country-by-country basis — several https://phabricator.wikimedia.org/T207171 long-standing requests https://phabricator.wikimedia.org/T267283 from Wikimedia communities.
Due to various technical factors, there are three distinct datasets:
-
1 July 2015 – 8 Feb 2017 https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/ / README https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/00_README.html (publishing threshold [1]: 3,500 pageviews) -
9 Feb 2017 – 5 Feb 2023 https://analytics.wikimedia.org/published/datasets/country_project_page_historical/ / README https://analytics.wikimedia.org/published/datasets/country_project_page_historical/00_README.html (publishing threshold: 450 pageviews) -
6 Feb 2023 – present https://analytics.wikimedia.org/published/datasets/country_project_page/ / README https://analytics.wikimedia.org/published/datasets/country_project_page/00_README.html (publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In the interim, I’ve built an example python notebook https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb illustrating how one might access the data in its current csv format, as well as several different kinds of simple analyses that can be done with it.
I also want to invite the research community to join me for a brief demo of this project at the July Research Showcase https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase. In the meantime, please feel free to reach out with any questions on the project talk page https://meta.wikimedia.org/wiki/Talk:Differential_privacy.
For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta https://meta.wikimedia.org/wiki/Differential_privacy. And in the future, look for more announcements of privatized datasets on editor behavior, on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the dataset in order to be published.
🎉🎉🎉 Congrats on this release! Looking forward to using it in some projects 😀
--
Nate
Hal Triedman htriedman@wikimedia.org writes:
Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I work to make data that WMF releases about reading, editing, and other on-wiki behavior safer, more granular, and more accessible to the world using differential privacy <https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Differential_priva... >.
Today I’m reaching out to share that WMF has released almost 8 years (from 1 July 2015 to present) of privatized pageview data <https://urldefense.com/v3/__https://diff.wikimedia.org/2023/06/21/new-datase... >, partitioned by country, project, and page. This data is significantly more granular than other datasets we release, and should help researchers to disambiguate both long- and short-term trends within languages on a country-by-country basis — several <https://urldefense.com/v3/__https://phabricator.wikimedia.org/T207171__;!!K-... > long-standing requests <https://urldefense.com/v3/__https://phabricator.wikimedia.org/T267283__;!!K-... > from Wikimedia communities.
Due to various technical factors, there are three distinct datasets:
1 July 2015 – 8 Feb 2017 <https://urldefense.com/v3/__https://analytics.wikimedia.org/published/datase... > / README <https://urldefense.com/v3/__https://analytics.wikimedia.org/published/datase... > (publishing threshold [1]: 3,500 pageviews)
9 Feb 2017 – 5 Feb 2023 <https://urldefense.com/v3/__https://analytics.wikimedia.org/published/datase... > / README <https://urldefense.com/v3/__https://analytics.wikimedia.org/published/datase... > (publishing threshold: 450 pageviews)
6 Feb 2023 – present <https://urldefense.com/v3/__https://analytics.wikimedia.org/published/datase... > / README <https://urldefense.com/v3/__https://analytics.wikimedia.org/published/datase... > (publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In the interim, I’ve built an example python notebook <https://urldefense.com/v3/__https://public-paws.wmcloud.org/67457802/private... > illustrating how one might access the data in its current csv format, as well as several different kinds of simple analyses that can be done with it.
I also want to invite the research community to join me for a brief demo of this project at the July Research Showcase <https://urldefense.com/v3/__https://www.mediawiki.org/wiki/Wikimedia_Researc... >. In the meantime, please feel free to reach out with any questions on the project talk page <https://urldefense.com/v3/__https://meta.wikimedia.org/wiki/Talk:Differentia... >.
For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta <https://urldefense.com/v3/__https://meta.wikimedia.org/wiki/Differential_pri... >. And in the future, look for more announcements of privatized datasets on editor behavior, on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the dataset in order to be published. _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Great dataset! This is amazing. I have no doubt that this will enable a lot of new research endeavors.
If I may have a suggestion: is it possible to also have wikidata id for each row? That way we can more conveniently match the same concepts across languages at large scale...
Best, Kai Zhu Assistant Professor at Bocconi University
On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman htriedman@wikimedia.org wrote:
Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I work to make data that WMF releases about reading, editing, and other on-wiki behavior safer, more granular, and more accessible to the world using differential privacy https://en.wikipedia.org/wiki/Differential_privacy.
Today I’m reaching out to share that WMF has released almost 8 years (from 1 July 2015 to present) of privatized pageview data < https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsin...
,
partitioned by country, project, and page. This data is significantly more granular than other datasets we release, and should help researchers to disambiguate both long- and short-term trends within languages on a country-by-country basis — several https://phabricator.wikimedia.org/T207171 long-standing requests https://phabricator.wikimedia.org/T267283 from Wikimedia communities.
Due to various technical factors, there are three distinct datasets:
1 July 2015 – 8 Feb 2017 < https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README < https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold [1]: 3,500 pageviews)
9 Feb 2017 – 5 Feb 2023 < https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README < https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold: 450 pageviews)
6 Feb 2023 – present < https://analytics.wikimedia.org/published/datasets/country_project_page/%3E / README < https://analytics.wikimedia.org/published/datasets/country_project_page/00_R...
(publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In the interim, I’ve built an example python notebook < https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
illustrating how one might access the data in its current csv format, as well as several different kinds of simple analyses that can be done with it.
I also want to invite the research community to join me for a brief demo of this project at the July Research Showcase https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase. In the meantime, please feel free to reach out with any questions on the project talk page https://meta.wikimedia.org/wiki/Talk:Differential_privacy.
For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta https://meta.wikimedia.org/wiki/Differential_privacy. And in the future, look for more announcements of privatized datasets on editor behavior, on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the dataset in order to be published. _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hi Kai!
Thanks for this suggestion — I'll put it on the list of improvements to this dataset, and hopefully be able to put it into production in the next month or two. In the meantime, the example python notebook https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb I linked above has a subsection entitled "Example of joining page_ids and titles to wikidata QID" that shows how you can retrieve a set of QIDs manually for a given page ID or title. Hope this helps get you started!
Thanks again, Hal
On Sun, Jun 25, 2023 at 4:30 PM Kai Zhu kaizhublcu@gmail.com wrote:
Great dataset! This is amazing. I have no doubt that this will enable a lot of new research endeavors.
If I may have a suggestion: is it possible to also have wikidata id for each row? That way we can more conveniently match the same concepts across languages at large scale...
Best, Kai Zhu Assistant Professor at Bocconi University
On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman htriedman@wikimedia.org wrote:
Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I work to make data that WMF releases about reading, editing, and other on-wiki behavior safer, more granular, and more accessible to the world using differential privacy https://en.wikipedia.org/wiki/Differential_privacy.
Today I’m reaching out to share that WMF has released almost 8 years
(from
1 July 2015 to present) of privatized pageview data <
https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsin...
,
partitioned by country, project, and page. This data is significantly
more
granular than other datasets we release, and should help researchers to disambiguate both long- and short-term trends within languages on a country-by-country basis — several https://phabricator.wikimedia.org/T207171 long-standing requests https://phabricator.wikimedia.org/T267283 from Wikimedia communities.
Due to various technical factors, there are three distinct datasets:
1 July 2015 – 8 Feb 2017 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold [1]: 3,500 pageviews)
9 Feb 2017 – 5 Feb 2023 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold: 450 pageviews)
6 Feb 2023 – present < https://analytics.wikimedia.org/published/datasets/country_project_page/
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page/00_R...
(publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In the interim, I’ve built an example python notebook <
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
illustrating how one might access the data in its current csv format, as well as several different kinds of simple analyses that can be done with it.
I also want to invite the research community to join me for a brief demo
of
this project at the July Research Showcase https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase. In the meantime, please feel free to reach out with any questions on the project talk page https://meta.wikimedia.org/wiki/Talk:Differential_privacy.
For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta https://meta.wikimedia.org/wiki/Differential_privacy. And in the
future,
look for more announcements of privatized datasets on editor behavior, on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the dataset
in
order to be published. _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hi,
I hope this message finds you well. I'm writing to follow up on our previous discussions about enhancing the pageviews data file by adding a QID column. My collaborator and I have identified several use cases where the ability to match concepts across languages at a large scale is crucial. Given the volume of articles we're working with, relying on API calls for millions of them isn't feasible. Incorporating the QID column would significantly benefit not only our project but also a wide range of potential users who may face similar challenges.
Thank you for considering this request. We believe this addition could greatly improve the utility and accessibility of the data for various research and analysis purposes.
Best regards, Kai Zhu Assistant Professor Bocconi University
On Mon, Jun 26, 2023 at 7:22 PM Hal Triedman htriedman@wikimedia.org wrote:
Hi Kai!
Thanks for this suggestion — I'll put it on the list of improvements to this dataset, and hopefully be able to put it into production in the next month or two. In the meantime, the example python notebook < https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
I linked above has a subsection entitled "Example of joining page_ids and titles to wikidata QID" that shows how you can retrieve a set of QIDs manually for a given page ID or title. Hope this helps get you started!
Thanks again, Hal
On Sun, Jun 25, 2023 at 4:30 PM Kai Zhu kaizhublcu@gmail.com wrote:
Great dataset! This is amazing. I have no doubt that this will enable a
lot
of new research endeavors.
If I may have a suggestion: is it possible to also have wikidata id for each row? That way we can more conveniently match the same concepts
across
languages at large scale...
Best, Kai Zhu Assistant Professor at Bocconi University
On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman htriedman@wikimedia.org wrote:
Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I
work
to make data that WMF releases about reading, editing, and other
on-wiki
behavior safer, more granular, and more accessible to the world using differential privacy https://en.wikipedia.org/wiki/Differential_privacy.
Today I’m reaching out to share that WMF has released almost 8 years
(from
1 July 2015 to present) of privatized pageview data <
https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsin...
,
partitioned by country, project, and page. This data is significantly
more
granular than other datasets we release, and should help researchers to disambiguate both long- and short-term trends within languages on a country-by-country basis — several https://phabricator.wikimedia.org/T207171 long-standing requests https://phabricator.wikimedia.org/T267283 from Wikimedia
communities.
Due to various technical factors, there are three distinct datasets:
1 July 2015 – 8 Feb 2017 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold [1]: 3,500 pageviews)
9 Feb 2017 – 5 Feb 2023 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold: 450 pageviews)
6 Feb 2023 – present <
https://analytics.wikimedia.org/published/datasets/country_project_page/
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page/00_R...
(publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In the interim, I’ve built an example python notebook <
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
illustrating how one might access the data in its current csv format,
as
well as several different kinds of simple analyses that can be done
with
it.
I also want to invite the research community to join me for a brief
demo
of
this project at the July Research Showcase https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase. In the meantime, please feel free to reach out with any questions on the
project
talk page https://meta.wikimedia.org/wiki/Talk:Differential_privacy.
For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta https://meta.wikimedia.org/wiki/Differential_privacy. And in the
future,
look for more announcements of privatized datasets on editor behavior, on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the dataset
in
order to be published. _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hi Kai!
Thank you for this reminder — when this dataset was published, there wasn't a consistently-updated, stable page ID <--> QID table available internally. Now there is. I'll see what I can get done on this in the next week or two, and send any updates as soon as I can :)
Thanks again, Hal
On Mon, Mar 4, 2024 at 10:04 AM Kai Zhu kaizhublcu@gmail.com wrote:
Hi,
I hope this message finds you well. I'm writing to follow up on our previous discussions about enhancing the pageviews data file by adding a QID column. My collaborator and I have identified several use cases where the ability to match concepts across languages at a large scale is crucial. Given the volume of articles we're working with, relying on API calls for millions of them isn't feasible. Incorporating the QID column would significantly benefit not only our project but also a wide range of potential users who may face similar challenges.
Thank you for considering this request. We believe this addition could greatly improve the utility and accessibility of the data for various research and analysis purposes.
Best regards, Kai Zhu Assistant Professor Bocconi University
On Mon, Jun 26, 2023 at 7:22 PM Hal Triedman htriedman@wikimedia.org wrote:
Hi Kai!
Thanks for this suggestion — I'll put it on the list of improvements to this dataset, and hopefully be able to put it into production in the next month or two. In the meantime, the example python notebook <
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
I linked above has a subsection entitled "Example of joining page_ids and titles to wikidata QID" that shows how you can retrieve a set of QIDs manually for a given page ID or title. Hope this helps get you started!
Thanks again, Hal
On Sun, Jun 25, 2023 at 4:30 PM Kai Zhu kaizhublcu@gmail.com wrote:
Great dataset! This is amazing. I have no doubt that this will enable a
lot
of new research endeavors.
If I may have a suggestion: is it possible to also have wikidata id for each row? That way we can more conveniently match the same concepts
across
languages at large scale...
Best, Kai Zhu Assistant Professor at Bocconi University
On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman <htriedman@wikimedia.org
wrote:
Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I
work
to make data that WMF releases about reading, editing, and other
on-wiki
behavior safer, more granular, and more accessible to the world using differential privacy https://en.wikipedia.org/wiki/Differential_privacy.
Today I’m reaching out to share that WMF has released almost 8 years
(from
1 July 2015 to present) of privatized pageview data <
https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsin...
,
partitioned by country, project, and page. This data is significantly
more
granular than other datasets we release, and should help researchers
to
disambiguate both long- and short-term trends within languages on a country-by-country basis — several https://phabricator.wikimedia.org/T207171 long-standing requests https://phabricator.wikimedia.org/T267283 from Wikimedia
communities.
Due to various technical factors, there are three distinct datasets:
1 July 2015 – 8 Feb 2017 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold [1]: 3,500 pageviews)
9 Feb 2017 – 5 Feb 2023 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold: 450 pageviews)
6 Feb 2023 – present <
https://analytics.wikimedia.org/published/datasets/country_project_page/
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page/00_R...
(publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In
the
interim, I’ve built an example python notebook <
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
illustrating how one might access the data in its current csv format,
as
well as several different kinds of simple analyses that can be done
with
it.
I also want to invite the research community to join me for a brief
demo
of
this project at the July Research Showcase https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase. In the meantime, please feel free to reach out with any questions on the
project
talk page https://meta.wikimedia.org/wiki/Talk:Differential_privacy.
For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta https://meta.wikimedia.org/wiki/Differential_privacy. And in the
future,
look for more announcements of privatized datasets on editor
behavior,
on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the
dataset
in
order to be published. _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hello Kai (and everyone else)!
I've updated these datasets (from 2017-present) to include an additional column with QID wherever possible. Please let me know if there are any issues or confusion about the datasets — I'm happy to get on calls, prioritize dataset improvements, or answer questions on this listserv :)
Happy analyses, Hal
On Mon, Mar 4, 2024 at 11:47 AM Hal Triedman htriedman@wikimedia.org wrote:
Hi Kai!
Thank you for this reminder — when this dataset was published, there wasn't a consistently-updated, stable page ID <--> QID table available internally. Now there is. I'll see what I can get done on this in the next week or two, and send any updates as soon as I can :)
Thanks again, Hal
On Mon, Mar 4, 2024 at 10:04 AM Kai Zhu kaizhublcu@gmail.com wrote:
Hi,
I hope this message finds you well. I'm writing to follow up on our previous discussions about enhancing the pageviews data file by adding a QID column. My collaborator and I have identified several use cases where the ability to match concepts across languages at a large scale is crucial. Given the volume of articles we're working with, relying on API calls for millions of them isn't feasible. Incorporating the QID column would significantly benefit not only our project but also a wide range of potential users who may face similar challenges.
Thank you for considering this request. We believe this addition could greatly improve the utility and accessibility of the data for various research and analysis purposes.
Best regards, Kai Zhu Assistant Professor Bocconi University
On Mon, Jun 26, 2023 at 7:22 PM Hal Triedman htriedman@wikimedia.org wrote:
Hi Kai!
Thanks for this suggestion — I'll put it on the list of improvements to this dataset, and hopefully be able to put it into production in the
next
month or two. In the meantime, the example python notebook <
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
I linked above has a subsection entitled "Example of joining page_ids
and
titles to wikidata QID" that shows how you can retrieve a set of QIDs manually for a given page ID or title. Hope this helps get you started!
Thanks again, Hal
On Sun, Jun 25, 2023 at 4:30 PM Kai Zhu kaizhublcu@gmail.com wrote:
Great dataset! This is amazing. I have no doubt that this will enable
a
lot
of new research endeavors.
If I may have a suggestion: is it possible to also have wikidata id
for
each row? That way we can more conveniently match the same concepts
across
languages at large scale...
Best, Kai Zhu Assistant Professor at Bocconi University
On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman <
htriedman@wikimedia.org>
wrote:
Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I
work
to make data that WMF releases about reading, editing, and other
on-wiki
behavior safer, more granular, and more accessible to the world
using
differential privacy https://en.wikipedia.org/wiki/Differential_privacy.
Today I’m reaching out to share that WMF has released almost 8 years
(from
1 July 2015 to present) of privatized pageview data <
https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsin...
,
partitioned by country, project, and page. This data is
significantly
more
granular than other datasets we release, and should help
researchers to
disambiguate both long- and short-term trends within languages on a country-by-country basis — several https://phabricator.wikimedia.org/T207171 long-standing requests https://phabricator.wikimedia.org/T267283 from Wikimedia
communities.
Due to various technical factors, there are three distinct datasets:
1 July 2015 – 8 Feb 2017 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold [1]: 3,500 pageviews)
9 Feb 2017 – 5 Feb 2023 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
(publishing threshold: 450 pageviews)
6 Feb 2023 – present <
https://analytics.wikimedia.org/published/datasets/country_project_page/
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page/00_R...
(publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months. In
the
interim, I’ve built an example python notebook <
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
illustrating how one might access the data in its current csv
format,
as
well as several different kinds of simple analyses that can be done
with
it.
I also want to invite the research community to join me for a brief
demo
of
this project at the July Research Showcase https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase. In
the
meantime, please feel free to reach out with any questions on the
project
talk page https://meta.wikimedia.org/wiki/Talk:Differential_privacy.
For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta https://meta.wikimedia.org/wiki/Differential_privacy. And in the
future,
look for more announcements of privatized datasets on editor
behavior,
on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the
dataset
in
order to be published. _______________________________________________ Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
Hi Hal,
Thank you so much! This is great. It makes matching "concept" across languages at large scale much easier!
Best, Kai
On Mon, Mar 18, 2024 at 7:29 PM Hal Triedman htriedman@wikimedia.org wrote:
Hello Kai (and everyone else)!
I've updated these datasets (from 2017-present) to include an additional column with QID wherever possible. Please let me know if there are any issues or confusion about the datasets — I'm happy to get on calls, prioritize dataset improvements, or answer questions on this listserv :)
Happy analyses, Hal
On Mon, Mar 4, 2024 at 11:47 AM Hal Triedman htriedman@wikimedia.org wrote:
Hi Kai!
Thank you for this reminder — when this dataset was published, there wasn't a consistently-updated, stable page ID <--> QID table available internally. Now there is. I'll see what I can get done on this in the
next
week or two, and send any updates as soon as I can :)
Thanks again, Hal
On Mon, Mar 4, 2024 at 10:04 AM Kai Zhu kaizhublcu@gmail.com wrote:
Hi,
I hope this message finds you well. I'm writing to follow up on our previous discussions about enhancing the pageviews data file by adding a QID column. My collaborator and I have identified several use cases
where
the ability to match concepts across languages at a large scale is crucial. Given the volume of articles we're working with, relying on API calls
for
millions of them isn't feasible. Incorporating the QID column would significantly benefit not only our project but also a wide range of potential users who may face similar challenges.
Thank you for considering this request. We believe this addition could greatly improve the utility and accessibility of the data for various research and analysis purposes.
Best regards, Kai Zhu Assistant Professor Bocconi University
On Mon, Jun 26, 2023 at 7:22 PM Hal Triedman htriedman@wikimedia.org wrote:
Hi Kai!
Thanks for this suggestion — I'll put it on the list of improvements
to
this dataset, and hopefully be able to put it into production in the
next
month or two. In the meantime, the example python notebook <
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
I linked above has a subsection entitled "Example of joining page_ids
and
titles to wikidata QID" that shows how you can retrieve a set of QIDs manually for a given page ID or title. Hope this helps get you
started!
Thanks again, Hal
On Sun, Jun 25, 2023 at 4:30 PM Kai Zhu kaizhublcu@gmail.com wrote:
Great dataset! This is amazing. I have no doubt that this will
enable
a
lot
of new research endeavors.
If I may have a suggestion: is it possible to also have wikidata id
for
each row? That way we can more conveniently match the same concepts
across
languages at large scale...
Best, Kai Zhu Assistant Professor at Bocconi University
On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman <
htriedman@wikimedia.org>
wrote:
Hello world!
My name is Hal Triedman, and I’m a senior privacy engineer at
WMF. I
work
to make data that WMF releases about reading, editing, and other
on-wiki
behavior safer, more granular, and more accessible to the world
using
differential privacy https://en.wikipedia.org/wiki/Differential_privacy.
Today I’m reaching out to share that WMF has released almost 8
years
(from
1 July 2015 to present) of privatized pageview data <
https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsin...
>, partitioned by country, project, and page. This data is
significantly
more
granular than other datasets we release, and should help
researchers to
disambiguate both long- and short-term trends within languages on
a
country-by-country basis — several https://phabricator.wikimedia.org/T207171 long-standing
requests
https://phabricator.wikimedia.org/T267283 from Wikimedia
communities.
Due to various technical factors, there are three distinct
datasets:
1 July 2015 – 8 Feb 2017 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
> / README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
> (publishing threshold [1]: 3,500 pageviews)
9 Feb 2017 – 5 Feb 2023 <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
> / README <
https://analytics.wikimedia.org/published/datasets/country_project_page_hist...
> (publishing threshold: 450 pageviews)
6 Feb 2023 – present <
https://analytics.wikimedia.org/published/datasets/country_project_page/
/ README <
https://analytics.wikimedia.org/published/datasets/country_project_page/00_R...
> (publishing threshold: 90 pageviews)
API access to this data should be coming in the next few months.
In
the
interim, I’ve built an example python notebook <
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
> illustrating how one might access the data in its current csv
format,
as
well as several different kinds of simple analyses that can be
done
with
it.
I also want to invite the research community to join me for a
brief
demo
of
this project at the July Research Showcase https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase. In
the
meantime, please feel free to reach out with any questions on the
project
talk page https://meta.wikimedia.org/wiki/Talk:Differential_privacy.
For more information about WMF’s work on differential privacy more generally, see the differential privacy homepage on meta https://meta.wikimedia.org/wiki/Differential_privacy. And in
the
future,
look for more announcements of privatized datasets on editor
behavior,
on-wiki search, centralnotice impressions and clicks, and more.
Best,
Hal
[1] “Publishing threshold” is the minimum value of a row in the
dataset
in
order to be published. _______________________________________________ Wiki-research-l mailing list --
wiki-research-l@lists.wikimedia.org
To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to
wiki-research-l-leave@lists.wikimedia.org
Wiki-research-l mailing list -- wiki-research-l@lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave@lists.wikimedia.org
wiki-research-l@lists.wikimedia.org