Dear Wikimedia analytics team,
We are 3 master students from Vrije Universiteit Amsterdam (VU) and Universtity of Amsterdam (UVA) doing a large scale data engineering project about detecting DDOS attacks on Wikipedia by analysing page views and traffic and trying to distinguish e.g. DDOS attacks from trending topics.
For this project, we need a lot of data. We found two sources of public data, Pageview complete (https://dumps.wikimedia.org/other/pageview_complete/) and the filtered version thereof (https://dumps.wikimedia.org/other/pageviews/). While these dumps are already quite useful, we also found that there is a dataset with even more information (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_ho…), in particular it contains the country a pageview came from and the referer, which could both be very useful for our project.
According to the above page, this dataset has been made private since 2018. We would like to ask whether it is possible to have access to this dataset for our research, or any other extended version of the public dump, which would enable us to do more in-depth research. We have our own cluster so we could work on a copy of the data. Moreover we would like to share our project and all our results with you to help contribute to your security measures.
Best regards,
Charel Felten, Gilles Magalhaes and Aleksander Janczewski
Hi all,
Join the Research Team at the Wikimedia Foundation [1] for their monthly
Office hours next Tuesday, 2021-10-05, at 16:00-17:00 UTC (9am PT/6pm
CEST). To participate, join the video-call via this link [2]. There is no
set agenda - feel free to add your item to the list of topics in the
etherpad [3] (You can do this after you join the meeting, too.), otherwise
you are welcome to also just hang out. More detailed information (e.g.
about how to attend) can be found here [4]. Through these office hours, we
aim to make ourselves more available to answer some of the research related
questions that you as Wikimedia volunteer editors, organizers, affiliates,
staff, and researchers face in your projects and initiatives. Some example
cases we hope to be able to support you in: - You have a specific research
related question that you suspect you should be able to answer with the
publicly available data and you don’t know how to find an answer for it, or
you just need some more help with it. For example, how can I compute the
ratio of anonymous to registered editors in my wiki? - You run into
repetitive or very manual work as part of your Wikimedia contributions and
you wish to find out if there are ways to use machines to improve your
workflows. These types of conversations can sometimes be harder to find an
answer for during an office hour, however, discussing them can help us
understand your challenges better and we may find ways to work with each
other to support you in addressing it in the future. - You want to learn
what the Research team at the Wikimedia Foundation does and how we can
potentially support you. Specifically for affiliates: if you are interested
in building relationships with the academic institutions in your country,
we would love to talk with you and learn more. We have a series of programs
that aim to expand the network of Wikimedia researchers globally and we
would love to collaborate with those of you interested more closely in this
space. - You want to talk with us about one of our existing programs [5].
Hope to see many of you, Emily on behalf of the WMF Research Team [1]
https://research.wikimedia.org [2]
https://meet.jit.si/WMF-Research-Office-Hours [3]
https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours [4]
https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[5] https://research.wikimedia.org/projects.html
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
TL;DR I would like to access wikipedia's articles' metadata (such as # edits, pageviews etc). I need to access a big volume of instances in order to train and maintain an online classifier and the API seems not sustainable. I was wondering which tool is the most appropriate for this task.
Hello everyone,
It is my first time interacting in this mailing list, so I will be happy to receive further feedbacks on how to better interact with the community :)
I crossposted this message to Wiki-research-l as well.
I am trying to access Wikipedia meta data in a streaming and time/resource sustainable manner. By meta data I mean many of the voices that can be found in the statistics of a wiki article, such as edits, editors list, page views etc.
I would like to do such for an online classifier type of structure: retrieve the data from a big number of wiki pages every tot time and use it as input for predictions.
I tried to use the Wiki API, however it is time and resource expensive, both for me and Wikipedia.
My preferred choice now would be to query the specific tables in the Wikipedia database, in the same way this is done through the Quarry tool. The problem with Quarry is that I would like to build a standalone script, without having to depend on a user interface like Quarry. Do you think that this is possible? I am still fairly new to all of this and I don’t know exactly which is the best direction.
I saw [1] that I could access wiki replicas both through Toolforge and PAWS, however I didn’t understand which one would serve me better, could I ask you for some feedback?
Also, as far as I understood [2], directly accessing the DB through Hive is too technical for what I need, right? Especially because it seems that I would need an account with production shell access and I honestly don’t think that I would be granted access to it. Also, I am not interested in accessing sensible and private data.
Last resource is parsing analytics dumps, however this seems less organic in the way of retrieving and polishing the data. As also, it would be strongly decentralised and physical-machine dependent, unless I upload the polished data online every time.
Sorry for this long message, but I thought it was better to give you a clearer picture (hoping this is clear enough). If you could give me even some hint it would be highly appreciated.
Best,
Cristina
Hello all,
The September Wikimedia Research Showcase will be on September 15 at 16:30
UTC (9:30am PT/ 12:30pm ET/ 18:30pm CEST). The theme will be "socialization
on Wikipedia" with speakers Rosta Farzan and J. Nathan Matias.
Livestream: https://www.youtube.com/watch?v=YVqabVvLIZU
Talk 1
Speaker: Rosta Farzan (School of Computing and Information, University of
Pittsburgh)
Title: Unlocking the Wikipedia clubhouse to newcomers: results from two
studies
Abstract: It is no news to any of us that success of online production
communities such as Wikipedia highly relies on a continuous stream of
newcomers to replace the inevitable high turnover and to bring on board new
sources of ideas and workforce. However, these communities have been
struggling with attracting newcomers, especially from a diverse population
of users, and further retention of newcomers. In this talk, I will present
about two different approaches in engaging new editors in Wikipedia: (1)
newcomers joining through the Wiki Ed program, an online program in which
college students edit Wikipedia articles as class assignments; (2)
newcomers joining through a Wikipedia Art+Feminism edit-a-thon. I present
how each approach incorporated techniques in engaging newcomers and how
they succeed in attracting and retention of newcomers.
More information:
- Bring on Board New Enthusiasts! A Case Study of Impact of Wikipedia
Art + Feminism Edit-A-Thon Events on Newcomers
<https://link.springer.com/chapter/10.1007/978-3-319-47880-7_2>, SocInfo
2016 (pdf
<http://saviaga.com/wp-content/uploads/2016/06/socinfo_ediathons.pdf>)
- Successful Online Socialization: Lessons from the Wikipedia Education
Program <https://dl.acm.org/doi/abs/10.1145/3392857>, CSCW 2020 (pdf
<https://www.cc.gatech.edu/~dyang888/docs/cscw_li_2020_wiki.pdf>)
Talk 2
Speaker: J. Nathan Matias <http://natematias.com/> (Citizens and Technology
Lab <http://citizensandtech.org/>, Cornell University Departments of
Communication and Information Science)
Title: The Effect of Receiving Appreciation on Wikipedias. A Community
Co-Designed Field Experiment
Abstract: Can saying “thank you” make online communities stronger & more
inclusive? Or does thanking others for their voluntary efforts have little
effect? To ask this question, the Citizens and Technology Lab (CAT Lab)
organized 344 volunteers to send thanks to Wikipedia contributors across
the Arabic, German, Polish, and Persian languages. We then observed the
behavior of 15,558 newcomers and experienced contributors to Wikipedia. On
average, we found that organizing volunteers to thank others increases
two-week retention of newcomers and experienced accounts. It also caused
people to send more thanks to others. This study was a field experiment, a
randomized trial that sent thanks to some people and not to others. These
experiments can help answer questions about the impact of community
practices and platform design. But they can sometimes face community
mistrust, especially when researchers conduct them without community
consent. In this talk, learn more about CAT Lab's approach to community-led
research and discuss open questions about best practices.
More information:
-
Volunteers Thanked Thousands of Wikipedia Editors to Learn the Effects
of Receiving Thanks
<https://citizensandtech.org/2020/06/effects-of-saying-thanks-on-wikipedia/>,
blogpost (in EN, DE, AR, PL, FA) <https://osf.io/ueq5f/>
-
The Diffusion and Influence of Gratitude Expressions in Large-Scale
Cooperation: A Field Experiment in Four Knowledge Networks
<https://osf.io/ueq5f/>, paper preprint
More information: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
--
Janna Layton (she/her)
Administrative Associate - Product & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
Join the Research Team at the Wikimedia Foundation [1] for their monthly
Office hours next Tuesday, 2021-09-07, at 16:00-17:00 UTC (9am PT/6pm
CEST).
To participate, join the video-call via this link [2]. There is no set
agenda - feel free to add your item to the list of topics in the etherpad
[3] (You can do this after you join the meeting, too.), otherwise you are
welcome to also just hang out. More detailed information (e.g. about how to
attend) can be found here [4].
Through these office hours, we aim to make ourselves more available to
answer some of the research related questions that you as Wikimedia
volunteer editors, organizers, affiliates, staff, and researchers face in
your projects and initiatives. Some example cases we hope to be able to
support you in:
-
You have a specific research related question that you suspect you
should be able to answer with the publicly available data and you don’t
know how to find an answer for it, or you just need some more help with it.
For example, how can I compute the ratio of anonymous to registered editors
in my wiki?
-
You run into repetitive or very manual work as part of your Wikimedia
contributions and you wish to find out if there are ways to use machines to
improve your workflows. These types of conversations can sometimes be
harder to find an answer for during an office hour, however, discussing
them can help us understand your challenges better and we may find ways to
work with each other to support you in addressing it in the future.
-
You want to learn what the Research team at the Wikimedia Foundation
does and how we can potentially support you. Specifically for affiliates:
if you are interested in building relationships with the academic
institutions in your country, we would love to talk with you and learn
more. We have a series of programs that aim to expand the network of
Wikimedia researchers globally and we would love to collaborate with those
of you interested more closely in this space.
-
You want to talk with us about one of our existing programs [5].
Hope to see many of you,
Martin on behalf of the WMF Research Team
[1] https://research.wikimedia.org
[2] https://meet.jit.si/WMF-Research-Office-Hours
[3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
[4] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[5] https://research.wikimedia.org/projects.html
--
Martin Gerlach
Research Scientist
Wikimedia Foundation
Hey all!
Tomorrow, during the Wikimania hackathon, some of us will be hanging out
and working on improving our documentation on public data, dashboards, and
research support. Please join us! Wikimania registration is not required.
WHAT: Clean up, expand, and reorganize Meta-Wiki's documentation on
research, data, and dashboards
WHEN: During the Wikimania Hackathaon, Friday, 13 August 05:00 UTC to
Saturday, 14 August 05:00 UTC
WHERE:
https://meet.jit.si/moderated/3741d369509c72904f5247702a8c14a9d2d0b893ea3e3…
(feel free to join without audio and video if you just want to text chat)
More information: https://phabricator.wikimedia.org/T288680
--
Neil Shah-Quinn
senior data scientist, Product Analytics
<https://www.mediawiki.org/wiki/Product_Analytics>
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
Join the Research Team at the Wikimedia Foundation [1] for their monthly
Office hours this Tuesday, 2021-08-03, at 16:00-17:00 UTC (9am PT/6pm
CEST).
To participate, join the video-call via this link [2]. There is no set
agenda - feel free to add your item to the list of topics in the etherpad
[3] (You can do this after you join the meeting, too.), otherwise you are
welcome to also just hang out. More detailed information (e.g. about how to
attend) can be found here [4].
Through these office hours, we aim to make ourselves more available to
answer some of the research related questions that you as Wikimedia
volunteer editors, organizers, affiliates, staff, and researchers face in
your projects and initiatives. Some example cases we hope to be able to
support you in:
-
You have a specific research related question that you suspect you
should be able to answer with the publicly available data and you don’t
know how to find an answer for it, or you just need some more help with it.
For example, how can I compute the ratio of anonymous to registered editors
in my wiki?
-
You run into repetitive or very manual work as part of your Wikimedia
contributions and you wish to find out if there are ways to use machines to
improve your workflows. These types of conversations can sometimes be
harder to find an answer for during an office hour, however, discussing
them can help us understand your challenges better and we may find ways to
work with each other to support you in addressing it in the future.
-
You want to learn what the Research team at the Wikimedia Foundation
does and how we can potentially support you. Specifically for affiliates:
if you are interested in building relationships with the academic
institutions in your country, we would love to talk with you and learn
more. We have a series of programs that aim to expand the network of
Wikimedia researchers globally and we would love to collaborate with those
of you interested more closely in this space.
-
You want to talk with us about one of our existing programs [5].
Hope to see many of you,
Martin on behalf of the WMF Research Team
[1] https://research.wikimedia.org
[2] https://meet.jit.si/WMF-Research-Office-Hours
[3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
[4] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[5] https://research.wikimedia.org/projects.html
--
Martin Gerlach
Research Scientist
Wikimedia Foundation
Hi Z. Blace,
you can watch the recording of this showcase on youtube [1].
Also, you can find the recordings of previous Research Showcases in this
collection [2].
Best,
Martin
[1] https://www.youtube.com/watch?v=otN3H-hIImQ
[2] https://www.youtube.com/playlist?list=PLhV3K_DS5YfLQLgwU3oDFiGaU3K7pUVoW
On Wed, Jul 21, 2021 at 7:39 PM Željko Blaće <zblace(a)mi2.hr> wrote:
> Overlapping with Art+Feminism session presenting research on almost the
> same topic :-/
>
> Again - calendar synchronization and wikimedia are not at level needed :-(
>
> Best Z. Blace
>
>
> On Wednesday, July 21, 2021, Janna Layton <jlayton(a)wikimedia.org> wrote:
>
> > The Research Showcase will be starting in about 30 minutes.
> >
> > On Thu, Jul 15, 2021 at 4:59 PM Janna Layton <jlayton(a)wikimedia.org>
> > wrote:
> >
> >> Hello all,
> >>
> >> The July Research Showcase will take place on July 21, 16:30 UTC (9:30am
> >> PT/ 12:30pm ET/ 18:30pm CEST). The theme is the effects of campaigns to
> >> close content gaps on Wikipedia, and speakers will be Kai Zhu from
> McGill
> >> University and Isabelle Langrock from the University of Pennsylvania.
> >>
> >> Livestream: https://www.youtube.com/watch?v=otN3H-hIImQ
> >>
> >> Talk 1
> >> Speaker: Kai Zhu (McGill University, Canada)
> >> Title: Addressing Information Poverty on Wikipedia
> >> Abstract: Open collaboration platforms have fundamentally changed the
> way
> >> that knowledge is produced, disseminated, and consumed. In these
> systems,
> >> contributions arise organically with little to no central governance.
> >> Although such decentralization provides many benefits, a lack of broad
> >> oversight and coordination can leave questions of information poverty
> and
> >> skewness to the mercy of the system’s natural dynamics. Unfortunately,
> we
> >> still lack a basic understanding of the dynamics at play in these
> systems
> >> and specifically, how contribution and attention interact and propagate
> >> through information networks. We leverage a large-scale natural
> experiment
> >> to study how exogenous content contributions to Wikipedia articles
> affect
> >> the attention that they attract and how that attention spills over to
> other
> >> articles in the network. Results reveal that exogenously added content
> >> leads to significant, substantial, and long-term increases in both
> content
> >> consumption and subsequent contributions. Furthermore, we find
> significant
> >> attention spillover to downstream hyperlinked articles. Through both
> >> analytical estimation and empirically informed simulation, we evaluate
> >> policies to harness this attention contagion to address the problem of
> >> information poverty and skewness. We find that harnessing attention
> >> contagion can lead to as much as a twofold increase in the total
> attention
> >> flow to clusters of disadvantaged articles. Our findings have important
> >> policy implications for open collaboration platforms and information
> >> networks.
> >>
> >> Talk 2
> >> Speaker: Isabelle Langrock (University of Pennsylvania, USA)
> >> Title: Quantifying and Assessing the Impact of Two Feminist
> Interventions
> >> Abstract: Wikipedia has a well-known gender divide affecting its
> >> biographical content. This bias not only shapes social perceptions of
> >> knowledge, but it can also propagate beyond the platform as its contents
> >> are leveraged to correct misinformation, train machine-learning tools,
> and
> >> enhance search engine results. What happens when feminist movements
> >> intervene to try to close existing gaps? In this talk, we present a
> recent
> >> study of two popular feminist interventions designed to counteract
> digital
> >> knowledge inequality. Our findings show that the interventions are
> >> successful at adding content about women that would otherwise be
> missing,
> >> but they are less successful at addressing several structural biases
> that
> >> limit the visibility of women within Wikipedia. We argue for more
> granular
> >> and cumulative analysis of gender divides in collaborative environments
> and
> >> identify key areas of support that can further aid the feminist
> movements
> >> in closing Wikipedia’s gender gaps.
> >>
> >> --
> >> Janna Layton (she/her)
> >> Administrative Associate - Product & Technology
> >> Wikimedia Foundation <https://wikimediafoundation.org/>
> >>
> >
> >
> > --
> > Janna Layton (she/her)
> > Administrative Associate - Product & Technology
> > Wikimedia Foundation <https://wikimediafoundation.org/>
> >
> _______________________________________________
> Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org
> To unsubscribe send an email to wiki-research-l-leave(a)lists.wikimedia.org
>
--
Martin Gerlach
Research Scientist
Wikimedia Foundation
Hello all,
The July Research Showcase will take place on July 21, 16:30 UTC (9:30am
PT/ 12:30pm ET/ 18:30pm CEST). The theme is the effects of campaigns to
close content gaps on Wikipedia, and speakers will be Kai Zhu from McGill
University and Isabelle Langrock from the University of Pennsylvania.
Livestream: https://www.youtube.com/watch?v=otN3H-hIImQ
Talk 1
Speaker: Kai Zhu (McGill University, Canada)
Title: Addressing Information Poverty on Wikipedia
Abstract: Open collaboration platforms have fundamentally changed the way
that knowledge is produced, disseminated, and consumed. In these systems,
contributions arise organically with little to no central governance.
Although such decentralization provides many benefits, a lack of broad
oversight and coordination can leave questions of information poverty and
skewness to the mercy of the system’s natural dynamics. Unfortunately, we
still lack a basic understanding of the dynamics at play in these systems
and specifically, how contribution and attention interact and propagate
through information networks. We leverage a large-scale natural experiment
to study how exogenous content contributions to Wikipedia articles affect
the attention that they attract and how that attention spills over to other
articles in the network. Results reveal that exogenously added content
leads to significant, substantial, and long-term increases in both content
consumption and subsequent contributions. Furthermore, we find significant
attention spillover to downstream hyperlinked articles. Through both
analytical estimation and empirically informed simulation, we evaluate
policies to harness this attention contagion to address the problem of
information poverty and skewness. We find that harnessing attention
contagion can lead to as much as a twofold increase in the total attention
flow to clusters of disadvantaged articles. Our findings have important
policy implications for open collaboration platforms and information
networks.
Talk 2
Speaker: Isabelle Langrock (University of Pennsylvania, USA)
Title: Quantifying and Assessing the Impact of Two Feminist Interventions
Abstract: Wikipedia has a well-known gender divide affecting its
biographical content. This bias not only shapes social perceptions of
knowledge, but it can also propagate beyond the platform as its contents
are leveraged to correct misinformation, train machine-learning tools, and
enhance search engine results. What happens when feminist movements
intervene to try to close existing gaps? In this talk, we present a recent
study of two popular feminist interventions designed to counteract digital
knowledge inequality. Our findings show that the interventions are
successful at adding content about women that would otherwise be missing,
but they are less successful at addressing several structural biases that
limit the visibility of women within Wikipedia. We argue for more granular
and cumulative analysis of gender divides in collaborative environments and
identify key areas of support that can further aid the feminist movements
in closing Wikipedia’s gender gaps.
--
Janna Layton (she/her)
Administrative Associate - Product & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
Join the Research Team at the Wikimedia Foundation [1] for their monthly
Office hours this Tuesday, 2021-07-13, at 16:00-17:00 UTC (9am PT/6pm
CEST).
To participate, join the video-call via this link [2]. There is no set
agenda - feel free to add your item to the list of topics in the etherpad
[3] (You can do this after you join the meeting, too.), otherwise you are
welcome to also just hang out. More detailed information (e.g. about how to
attend) can be found here [4].
Through these office hours, we aim to make ourselves more available to
answer some of the research related questions that you as Wikimedia
volunteer editors, organizers, affiliates, staff, and researchers face in
your projects and initiatives. Some example cases we hope to be able to
support you in:
-
You have a specific research related question that you suspect you
should be able to answer with the publicly available data and you don’t
know how to find an answer for it, or you just need some more help with it.
For example, how can I compute the ratio of anonymous to registered editors
in my wiki?
-
You run into repetitive or very manual work as part of your Wikimedia
contributions and you wish to find out if there are ways to use machines to
improve your workflows. These types of conversations can sometimes be
harder to find an answer for during an office hour, however, discussing
them can help us understand your challenges better and we may find ways to
work with each other to support you in addressing it in the future.
-
You want to learn what the Research team at the Wikimedia Foundation
does and how we can potentially support you. Specifically for affiliates:
if you are interested in building relationships with the academic
institutions in your country, we would love to talk with you and learn
more. We have a series of programs that aim to expand the network of
Wikimedia researchers globally and we would love to collaborate with those
of you interested more closely in this space.
-
You want to talk with us about one of our existing programs [5].
Hope to see many of you,
Martin on behalf of the WMF Research Team
[1] https://research.wikimedia.org/team.html
[2] https://meet.jit.si/WMF-Research-Office-Hours
[3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
[4] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[5] https://research.wikimedia.org/projects.html
--
Martin Gerlach
Research Scientist
Wikimedia Foundation