Hi all,
The next Research Showcase, featuring the recipients of this year's
Wikimedia Foundation Research Awards of the Year, will be live-streamed
Wednesday, July 20, at 9:30 AM PST/16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1658334607>.
YouTube stream: https://www.youtube.com/watch?v=KMvXOQU5fX4
<https://www.youtube.com/watch?v=KMvXOQU5fX4>
You are welcome to ask questions via YouTube chat or on IRC at
#wikimedia-research.
This month's presentations:
Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine
LearningBy *Krishna Srinivasan (Google)*The milestone improvements brought
about by deep representation learning and pre-training techniques have led
to large performance gains across downstream NLP, IR and Vision tasks.
Multimodal modeling techniques aim to leverage large high-quality
visio-linguistic datasets for learning complementary information across
image and text modalities. In this talk, I introduce the Wikipedia-based
Image Text (WIT) Dataset to better facilitate multimodal, multilingual
learning. WIT is composed of a curated set of 37.5 million entity rich
image-text examples with 11.5 million unique images across 108 Wikipedia
languages.
WIT’s unique advantages include: WIT is the largest multimodal dataset by
the number of image-text examples by 3x (at the time of writing). WIT is
massively multilingual (first of its kind) with coverage over 100+
languages. WIT represents a more diverse set of concepts and real world
entities relative to what previous datasets cover.
WIT Dataset is available for download and use via a Creative Commons
license here: https://github.com/google-research-datasets/wit
I conclude the talk with future directions to expand and extend the WIT
dataset. Link to paperː https://arxiv.org/pdf/2103.01913.pdf
Assessing the Quality of Sources in Wikidata Across LanguagesBy *Gabriel
Amaral (King's College London)*Wikidata is one of the most important
sources of structured data on the web, built by a worldwide community of
volunteers. As a secondary source, its contents must be backed by credible
references; this is particularly important as Wikidata explicitly
encourages editors to add claims for which there is no broad consensus, as
long as they are corroborated by references. Nevertheless, despite this
essential link between content and references, Wikidata’s ability to
systematically assess and assure the quality of its references remains
limited. To this end, we carry out a mixed-methods study to determine the
relevance, ease of access, and authoritativeness of Wikidata references, at
scale and in different languages, using online crowdsourcing, descriptive
statistics, and machine learning. The findings help us ascertain the
quality of references in Wikidata, and identify common challenges in
defining and capturing the quality of user-generated multilingual
structured data on the web. Link to paperː
https://dl.acm.org/doi/abs/10.1145/3484828
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Emily, on behalf of the Research team
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
Hello,
I am one of the test engineers on the QTE team.
There is a plan to migrate the MediaWiki software on production to
Kubernetes.
In preparation for this, we will be migrating test2wiki to Kubernetes
first so that QTE can test it and catch any bugs before the wider
roll-out.
I am trying to identify areas of our software for which the migration to
Kubernetes might pose a risk.
I wonder if this might be true of any of the software you are
responsible for. In particular, I am thinking about where MediaWiki is
interacting with different services in our ecosystem. I don't know
enough about this area to make an informed judgement.
Any ideas about what might be risky and in need of testing, and how one
might go about testing it on test2wiki
(https://test2.wikipedia.org/wiki/Main_Page) would be of great help to
me.
Let me know if you have any questions.
Thank you,
Dom
Hi all,
Join the Research Team at the Wikimedia Foundation [1] for their monthly
Office hours Tuesday, 2022-07-05. Find your local time here
<https://zonestamp.toolforge.org/1657036800>.
To participate, join the video-call via this link [2]. There is no set
agenda - feel free to add your item to the list of topics in the etherpad
[3]. You are welcome to add questions / items to the etherpad in advance,
or when you arrive at the session. Even if you are unable to attend the
session, you can leave a question that we can address asynchronously. If
you do not have a specific agenda item, you are welcome to hang out and
enjoy the conversation. More detailed information (e.g., about how to
attend) can be found here [4].
Through these office hours, we aim to make ourselves available to answer
research related questions that you as Wikimedia volunteer editors,
organizers, affiliates, staff, and researchers face in your projects and
initiatives. Here are some example cases we hope to be able to support you
with:
-
You have a specific research related question that you suspect you
should be able to answer with the publicly available data and you don’t
know how to find an answer for it, or you just need some more help with it.
For example, how can I compute the ratio of anonymous to registered editors
in my wiki?
-
You run into repetitive or very manual work as part of your Wikimedia
contributions and you wish to find out if there are ways to use machines to
improve your workflows. These types of conversations can sometimes be
harder to find an answer for during an office hour. However, discussing
them can help us understand your challenges better and we may find ways to
work with each other to support you in addressing it in the future.
-
You want to learn what the Research team at the Wikimedia Foundation
does and how we can potentially support you. Specifically for affiliates:
if you are interested in building relationships with the academic
institutions in your country, we would love to talk with you and learn
more. We have a series of programs that aim to expand the network of
Wikimedia researchers globally and we would love to collaborate with those
of you interested more closely in this space.
-
You want to talk with us about one of our existing programs [5].
Hope to see many of you,
Emily, on behalf of the WMF Research Team
[1] https://research.wikimedia.org
[2] https://meet.jit.si/WMF-Research-Office-Hours
[3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
[4] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[5] https://research.wikimedia.org/projects.html
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
Hi all,
The next Research Showcase, *Wikipedia's Languages*, will be live-streamed
Wednesday, June 15, at 4:00 AM PST/11:00 AM UTC. View your local time here
<https://zonestamp.toolforge.org/1655290800>.
YouTube stream: https://www.youtube.com/watch?v=AZQM1dtn3g0
You are welcome to ask questions via YouTube chat or on IRC at
#wikimedia-research.
This month's presentations:
Quantifying knowledge synchronisation in the 21st centuryBy *Jisung Yoon
(Pohang University of Science and Technology)*Humans acquire and accumulate
knowledge through language usage and eagerly exchange their knowledge for
advancement. Although geographical barriers had previously limited
communication, the emergence of information technology has opened new
avenues for knowledge exchange. However, it is unclear which communication
pathway is dominant in the 21st century. Here, we explore the dominant path
of knowledge diffusion in the 21st century using Wikipedia, the largest
communal dataset. We evaluate the similarity of shared knowledge between
population groups, distinguished based on their language usage. When
population groups are more engaged with each other, their knowledge
structure is more similar, where engagement is indicated by socio-economic
connections, such as cultural, linguistic, and historical features.
Moreover, geographical proximity is no longer a critical requirement for
knowledge dissemination. Furthermore, we integrate our data into a
mechanistic model to better understand the underlying mechanism and suggest
that the knowledge "Silk Road" of the 21st century is based online.
The Language Geography of WikipediaBy *Martin Dittus*Every language is a
system of being, doing, knowing, and imagining. With over 7,000 active
languages in the world, how many languages are fully represented online? To
answer this question, digital non-profit Whose Knowledge? initiated the
first ever report on the State of the Internet's Languages. As part of this
report, Martin Dittus and Mark Graham have investigated the languages of
Wikipedia. Wikipedia began with a single English-language edition more than
two decades ago, and now offers more than 300 language editions, which
places it at the forefront of digital language support. However, this does
not mean that speakers of these languages get access to the same content:
Wikipedia’s language editions vary widely in scale. We further find that
this inequality is also reflected in Wikipedia’s geographic coverage: not
all places are captured in every language. Wikipedia's coverage often
follows the global distribution of speakers of the respective language. Yet
even when we account for the distribution of language populations, certain
language communities are much more strongly represented on Wikipedia than
others. As a consequence, we find that for many countries in Africa,
Central and South America, and South Asia, most of the content about those
countries is in a foreign language, often a European-colonial language. In
other words, in many of these places, people may need to be able to speak a
second (possibly foreign) language in order to access Wikipedia information
about their own places. Why do we see these differences? And what can be
done to improve things?
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Emily, on behalf of the Research team
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
Hi all,
The registration for Wiki Workshop 2022 [1] is now open. The event is
virtually held on April 25, 12:00-18:30 UTC and as part of The Web
Conference 2022 [2]. The plenary parts of the event will be recorded
and shared publicly afterwards.
Wiki Workshop is the largest Wikimedia research event of the year (so
far;) that the Research team at the Wikimedia Foundation co-organizes
with our Research Fellow, Bob West (EPFL). This year, Srijan Kumar
(Georgia Tech) joined the organizing team as well.:) The event brings
together scholars and researchers from across the world who are
interested in or are actively engaged with research and development on
the Wikimedia projects.
While the details of the schedule are to be finalized and posted in
the coming week, we expect to generally follow the format of 2021 [3].
This year we received research submissions from more than 20 countries
and have accepted 27 research papers whose authors will present the
work as part of the workshop (If you are an author of an accepted
paper: congrats!:) . Our keynote speaker is Larry Lessig [4] and we
will have a panel to reflect on the decade anniversary of SOPA/PIPA,
moderated by Erik Moeller (Freedom of the Press). And of course, all
the music, games, etc. will remain. :)
If you are interested in participating in the live event, please
indicate your interest by filling out [5]. Anyone is encouraged to
register: you don't have to be a researcher. In the registration form,
please explain why attending the live event will support you in your
work on the Wikimedia projects and beyond.
If you have questions, please don't hesitate to reach out.
Best,
Leila
[1] https://wikiworkshop.org/2022/
[2] https://www2022.thewebconf.org/
[3] https://wikiworkshop.org/2021/#schedule
[4] https://hls.harvard.edu/faculty/directory/10519/Lessig
[5] (privacy statement for the Google form survey [6])
https://docs.google.com/forms/d/e/1FAIpQLSctlkUv8FasB2Nc4RvThnxAbjPzUwmnxB2…
[6] https://foundation.wikimedia.org/wiki/Legal:Wiki_Workshop_Registration_Priv…
--
Leila Zia
Head of Research
Wikimedia Foundation
Hi all,
Join the Research Team at the Wikimedia Foundation [1] for their monthly
Office hours Tuesday, 2022-06-07. Find your local time here
<https://zonestamp.toolforge.org/1654642800>.
To participate, join the video-call via this link [2]. There is no set
agenda - feel free to add your item to the list of topics in the etherpad
[3]. You are welcome to add questions / items to the etherpad in advance,
or when you arrive at the session. Even if you are unable to attend the
session, you can leave a question that we can address asynchronously. If
you do not have a specific agenda item, you are welcome to hang out and
enjoy the conversation. More detailed information (e.g., about how to
attend) can be found here [4].
Through these office hours, we aim to make ourselves available to answer
research related questions that you as Wikimedia volunteer editors,
organizers, affiliates, staff, and researchers face in your projects and
initiatives. Here are some example cases we hope to be able to support you
with:
-
You have a specific research related question that you suspect you
should be able to answer with the publicly available data and you don’t
know how to find an answer for it, or you just need some more help with it.
For example, how can I compute the ratio of anonymous to registered editors
in my wiki?
-
You run into repetitive or very manual work as part of your Wikimedia
contributions and you wish to find out if there are ways to use machines to
improve your workflows. These types of conversations can sometimes be
harder to find an answer for during an office hour. However, discussing
them can help us understand your challenges better and we may find ways to
work with each other to support you in addressing it in the future.
-
You want to learn what the Research team at the Wikimedia Foundation
does and how we can potentially support you. Specifically for affiliates:
if you are interested in building relationships with the academic
institutions in your country, we would love to talk with you and learn
more. We have a series of programs that aim to expand the network of
Wikimedia researchers globally and we would love to collaborate with those
of you interested more closely in this space.
-
You want to talk with us about one of our existing programs [5].
Hope to see many of you,
Emily, on behalf of the WMF Research Team
[1] https://research.wikimedia.org
[2] https://meet.jit.si/WMF-Research-Office-Hours
[3] https://etherpad.wikimedia.org/p/Research-Analytics-Office-hours
[4] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[5] https://research.wikimedia.org/projects.html
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
To all observers,
Okay so I wouldn't even bother with the idea of altering infrastructure,
I'd Focus More On Substructure In between each branch, even though they
would later have to go through a phishing obviously if the idea is getting
the organization to proceed through the initial intake and have a
filtration system with a protocols to ensure nothing is ever able to be
considered stagnant.Especially avoiding the ongoing process resulting in
the stress upon the colliding aforementioned intake process that's built
noticeably compiled attention from different standpoints.
I also haven't the slightest clue in that which is I am amidst the process
of, but i just reread it and sounds like that would overcomplicate the
units pathing/macros/scripting /trigger/actionbecausei seriously am so sick
i cannot even keep focus and cold sweats and shivering i will take my
leave for a little r&r will be strong and recuperate by next week.
Dear Sir or Madam,
Writing to you with a question about Pageviews hourly raw data files
<https://dumps.wikimedia.org/other/pageviews/readme.html>. First of all,
let me know if I chose the right person for a question. If not, could you
please advise to whom I should direct the question? The question is below.
I am working on a project where we would like to use Pageviews hourly data
<https://dumps.wikimedia.org/other/pageviews/readme.html>. For us, it is
crucial to get data as soon as possible. As I can see on the web page,
hourly data is available in the Wikimedia's file system approximately 45min
after the hour ends. But for an end-user, it is available several hours
later after that (this is shown on the screenshot).
Could you help us by answering the following questions:
1. Is there any way to get data as soon as it is available on the
Wikimedia filesystem (~45 min after the hour ends)?
2. Are there any other faster ways to get hourly data? For instance,
faster access to raw data files or access to *wmf.pageview_hourly
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_ho…>*
or
to *wmf.pageviews_actor
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_ac…>*.
Unfortunately,
API does not provide the opportunity to get data on an hourly level.
Best regards,
Maxim Aparovich
[image: wiki-email.png]
Hello everyone,
The next Research Showcase, *Gaps and Biases in Wikipedia*, will be
live-streamed Wednesday, May 18, at 9:30 AM PST/16:30 UTC. View your local
time here <https://zonestamp.toolforge.org/1652891400>.
YouTube stream: https://www.youtube.com/watch?v=Q8FlunZ0mH4
You are welcome to ask questions via YouTube chat or on IRC at
#wikimedia-research.
This month's presentations:
Ms. Categorized: Gender, notability, and inequality on Wikipedia
By Francesca Tripodi (University of North Carolina at Chapel Hill)
For the last five decades, sociologists have argued that gender is one of
the most pervasive and insidious forms of inequality. Research demonstrates
how these inequalities persist on Wikipedia - arguably the largest
encyclopedic reference in existence. Roughly eighty percent of Wikipedia's
editors are men and pages about women and women's interests are
underrepresented. English language Wikipedia contains more than 1.5 million
biographies about notable writers, inventors, and academics, but less than
nineteen percent of these biographies are about women. To try and improve
these statistics, activists host “edit-a-thons” to increase the visibility
of notable women. While this strategy helps create several biographies
previously inexistent, it fails to address a more inconspicuous form of
gender exclusion. Drawing on ethnographic observations, interviews, and
quantitative analysis of web-scraped metadata this talk demonstrates that
women’s biographies are more frequently considered non-notable and
nominated for deletion compared to men’s biographies. This disproportionate
rate is another dimension of gender inequality on Wikipedia previously
unexplored by social scientists and provides broader insights into how
women’s achievements are (under)valued in society.
Controlled Analyses of Social Biases in Wikipedia Bios
By Yulia Tsvetkov (University of Washington)
Social biases on Wikipedia could greatly influence public opinion.
Wikipedia is also a popular source of training data for NLP models, and
subtle biases in Wikipedia narratives are liable to be amplified in
downstream NLP models. In this talk I'll present two approaches to
unveiling social biases in how people are described on Wikipedia, across
demographic attributes and across languages. First, I'll present a
methodology that isolates dimensions of interest (e.g., gender), from other
attributes (e.g., occupation). This methodology allows us to quantify
systemic differences in coverage of different genders and races, while
controlling for confounding factors. Next, I'll show an NLP case study that
uses this methodology in combination with people-centric sentiment analysis
to identify disparities in Wikipedia bios of members of the LGBTQIA+
community across three languages: English, Russian, and Spanish. Our
results surface cultural differences in narratives and signs of social
biases. Practically, these methods can be used to automatically identify
Wikipedia articles for further manual analysis—articles that might contain
content gaps or an imbalanced representation of particular social groups.
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Emily, on behalf of the Research team
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation