Hi,
I hope this message finds you well. I'm writing to follow up on our
previous discussions about enhancing the pageviews data file by adding a
QID column. My collaborator and I have identified several use cases where
the ability to match concepts across languages at a large scale is crucial.
Given the volume of articles we're working with, relying on API calls for
millions of them isn't feasible. Incorporating the QID column would
significantly benefit not only our project but also a wide range of
potential users who may face similar challenges.
Thank you for considering this request. We believe this addition could
greatly improve the utility and accessibility of the data for various
research and analysis purposes.
Best regards,
Kai Zhu
Assistant Professor
Bocconi University
On Mon, Jun 26, 2023 at 7:22 PM Hal Triedman <htriedman(a)wikimedia.org>
wrote:
Hi Kai!
Thanks for this suggestion — I'll put it on the list of improvements to
this dataset, and hopefully be able to put it into production in the next
month or two. In the meantime, the example python notebook
<
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
I linked above has a subsection entitled "Example of joining page_ids and
titles to wikidata QID" that shows how you can retrieve a set of QIDs
manually for a given page ID or title. Hope this helps get you started!
Thanks again,
Hal
On Sun, Jun 25, 2023 at 4:30 PM Kai Zhu <kaizhublcu(a)gmail.com> wrote:
Great dataset! This is amazing. I have no doubt
that this will enable a
lot
> of new research endeavors.
> If I may have a suggestion: is it possible to also have wikidata id for
> each row? That way we can more conveniently match the same concepts
across
> languages at large scale...
> Best,
> Kai Zhu
> Assistant Professor at Bocconi University
> On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman
<htriedman(a)wikimedia.org>
> wrote:
> > Hello world!
>
> > My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I
work
> to make data that WMF releases about
reading, editing, and other
on-wiki
> > behavior safer, more granular, and more accessible to the world using
> > differential
> > privacy <https://en.wikipedia.org/wiki/Differential_privacy>.
>
> > Today I’m reaching out to share that WMF has released almost 8 years
> (from
> > 1 July 2015 to present) of privatized pageview data
> > <
>
https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsi…
,
partitioned by country, project, and page. This data is significantly
more
> granular than other datasets we release, and should help researchers to
> disambiguate both long- and short-term trends within languages on a
> country-by-country basis — several
> <https://phabricator.wikimedia.org/T207171> long-standing requests
> <https://phabricator.wikimedia.org/T267283> from Wikimedia
communities.
>
> > Due to various technical factors, there are three distinct datasets:
>
> > -
>
> > 1 July 2015 – 8 Feb 2017
> > <
>
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
> >
> > / README
> > <
>
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
> >
> > (publishing threshold [1]: 3,500 pageviews)
> > -
>
> > 9 Feb 2017 – 5 Feb 2023
> > <
>
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
> >
> > / README
> > <
>
https://analytics.wikimedia.org/published/datasets/country_project_page_his…
> >
> > (publishing threshold: 450 pageviews)
> > -
>
> > 6 Feb 2023 – present
> > <
>
https://analytics.wikimedia.org/published/datasets/country_project_page/
>
> > / README
> > <
>
https://analytics.wikimedia.org/published/datasets/country_project_page/00_…
> >
> > (publishing threshold: 90 pageviews)
>
>
> > API access to this data should be coming in the next few months. In
the
> > interim, I’ve built an example python notebook
> > <
>
https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
> >
> > illustrating how one might access the data in its current csv format,
as
> well as several different kinds of simple
analyses that can be done
with
> > it.
>
> > I also want to invite the research community to join me for a brief
demo
of
> this project at the July Research Showcase
> <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase>. In the
> meantime, please feel free to reach out with any questions on the
project
> > talk
> > page <https://meta.wikimedia.org/wiki/Talk:Differential_privacy>.
>
> > For more information about WMF’s work on differential privacy more
> > generally, see the differential privacy homepage on meta
> > <https://meta.wikimedia.org/wiki/Differential_privacy>. And in the
> future,
> > look for more announcements of privatized datasets on editor behavior,
> > on-wiki search, centralnotice impressions and clicks, and more.
>
> > Best,
>
> > Hal
>
> > [1] “Publishing threshold” is the minimum value of a row in the
dataset
> in
> > order to be published.
> > _______________________________________________
> > Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
> > To unsubscribe send an email to
> wiki-research-l-leave(a)lists.wikimedia.org
>
> _______________________________________________
> Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
> To unsubscribe send an email to
wiki-research-l-leave(a)lists.wikimedia.org
_______________________________________________
Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
To unsubscribe send an email to wiki-research-l-leave(a)lists.wikimedia.org