The registration for Wiki Workshop 2022  is now open. The event is
virtually held on April 25, 12:00-18:30 UTC and as part of The Web
Conference 2022 . The plenary parts of the event will be recorded
and shared publicly afterwards.
Wiki Workshop is the largest Wikimedia research event of the year (so
far;) that the Research team at the Wikimedia Foundation co-organizes
with our Research Fellow, Bob West (EPFL). This year, Srijan Kumar
(Georgia Tech) joined the organizing team as well.:) The event brings
together scholars and researchers from across the world who are
interested in or are actively engaged with research and development on
the Wikimedia projects.
While the details of the schedule are to be finalized and posted in
the coming week, we expect to generally follow the format of 2021 .
This year we received research submissions from more than 20 countries
and have accepted 27 research papers whose authors will present the
work as part of the workshop (If you are an author of an accepted
paper: congrats!:) . Our keynote speaker is Larry Lessig  and we
will have a panel to reflect on the decade anniversary of SOPA/PIPA,
moderated by Erik Moeller (Freedom of the Press). And of course, all
the music, games, etc. will remain. :)
If you are interested in participating in the live event, please
indicate your interest by filling out . Anyone is encouraged to
register: you don't have to be a researcher. In the registration form,
please explain why attending the live event will support you in your
work on the Wikimedia projects and beyond.
If you have questions, please don't hesitate to reach out.
 (privacy statement for the Google form survey )
Head of Research
Dear Sir or Madam,
Writing to you with a question about Pageviews hourly raw data files
<https://dumps.wikimedia.org/other/pageviews/readme.html>. First of all,
let me know if I chose the right person for a question. If not, could you
please advise to whom I should direct the question? The question is below.
I am working on a project where we would like to use Pageviews hourly data
<https://dumps.wikimedia.org/other/pageviews/readme.html>. For us, it is
crucial to get data as soon as possible. As I can see on the web page,
hourly data is available in the Wikimedia's file system approximately 45min
after the hour ends. But for an end-user, it is available several hours
later after that (this is shown on the screenshot).
Could you help us by answering the following questions:
1. Is there any way to get data as soon as it is available on the
Wikimedia filesystem (~45 min after the hour ends)?
2. Are there any other faster ways to get hourly data? For instance,
faster access to raw data files or access to *wmf.pageview_hourly
API does not provide the opportunity to get data on an hourly level.
The next Research Showcase, *Gaps and Biases in Wikipedia*, will be
live-streamed Wednesday, May 18, at 9:30 AM PST/16:30 UTC. View your local
time here <https://zonestamp.toolforge.org/1652891400>.
YouTube stream: https://www.youtube.com/watch?v=Q8FlunZ0mH4
You are welcome to ask questions via YouTube chat or on IRC at
This month's presentations:
Ms. Categorized: Gender, notability, and inequality on Wikipedia
By Francesca Tripodi (University of North Carolina at Chapel Hill)
For the last five decades, sociologists have argued that gender is one of
the most pervasive and insidious forms of inequality. Research demonstrates
how these inequalities persist on Wikipedia - arguably the largest
encyclopedic reference in existence. Roughly eighty percent of Wikipedia's
editors are men and pages about women and women's interests are
underrepresented. English language Wikipedia contains more than 1.5 million
biographies about notable writers, inventors, and academics, but less than
nineteen percent of these biographies are about women. To try and improve
these statistics, activists host “edit-a-thons” to increase the visibility
of notable women. While this strategy helps create several biographies
previously inexistent, it fails to address a more inconspicuous form of
gender exclusion. Drawing on ethnographic observations, interviews, and
quantitative analysis of web-scraped metadata this talk demonstrates that
women’s biographies are more frequently considered non-notable and
nominated for deletion compared to men’s biographies. This disproportionate
rate is another dimension of gender inequality on Wikipedia previously
unexplored by social scientists and provides broader insights into how
women’s achievements are (under)valued in society.
Controlled Analyses of Social Biases in Wikipedia Bios
By Yulia Tsvetkov (University of Washington)
Social biases on Wikipedia could greatly influence public opinion.
Wikipedia is also a popular source of training data for NLP models, and
subtle biases in Wikipedia narratives are liable to be amplified in
downstream NLP models. In this talk I'll present two approaches to
unveiling social biases in how people are described on Wikipedia, across
demographic attributes and across languages. First, I'll present a
methodology that isolates dimensions of interest (e.g., gender), from other
attributes (e.g., occupation). This methodology allows us to quantify
systemic differences in coverage of different genders and races, while
controlling for confounding factors. Next, I'll show an NLP case study that
uses this methodology in combination with people-centric sentiment analysis
to identify disparities in Wikipedia bios of members of the LGBTQIA+
community across three languages: English, Russian, and Spanish. Our
results surface cultural differences in narratives and signs of social
biases. Practically, these methods can be used to automatically identify
Wikipedia articles for further manual analysis—articles that might contain
content gaps or an imbalanced representation of particular social groups.
You can also watch our past research showcases here:
Emily, on behalf of the Research team
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
We use the Wikimedia AQS Pageviews REST API: [Analytics/AQS/Pageviews
When making requests for pageviews counts by article, we have noticed
that not all data for all pages will exist for the latest day at the
same time. Some pages appear to be updated later than others. Is there
a place we can check (i.e. a status page or dump files) to determine
whether all pageview data is accessible for the latest day via the AQS
Pageviews REST API?