Dear members of the Analytics Team,
I am currently conducting research about the excludability of free
knowledge available on the Wikimedia projects as an example of a public
good. In order to calibrate the model, I need aggregate data on the page
views and edits by country and language.
After having carefully read Research:Data
<https://meta.wikimedia.org/wiki/Research:Data>, I was only able to find
data on page views by country and language, which would be enough to
calibrate the demand side of my model. So, is it possible to get aggregate
data on edits by country and language, which are similar to those on page
views available at WikiStats?
Thanks in advance.
Best regards,
Kiril Simeonovski
Deprecation of Spark v2 scheduled for July 5th
The Data Engineering team is planning to deprecate Spark 2 on July 5th
2023. Its replacement, Spark 3 is already available and all of our
production data pipelines have been migrated successfully to this new
version
<https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Spark/Migration_to_Sp…>.
We have also assisted in the migration of several other teams’ Spark 2
pipelines to Spark 3, but there may still be other Spark 2 jobs that are
configured in code outside of our control.
We encourage you, therefore, to review any of your own Spark jobsthat
you run, to verify that they have been upgraded to work with Spark 3. In
most cases, this will mean checking that the command-line interfaces for
spark
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#…>use
one of the supported forms, such as spark3-submitor pyspark3. In some
cases this may also mean upgrading your conda environments
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda#Migratin…>on
the stats servers from anaconda-wmfto conda-analytics, if you have not
already done so.
The specific change that is scheduled to happen on July 5th is a switch
of spark shuffler version used by YARN
<https://phabricator.wikimedia.org/T332765>from 2 to 3. This should
bring significant performance benefits for existing spark3 jobs, but it
is more than likely that any spark2 jobs attempting to use this new
shuffler will fail.
Please do reach out
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Contact>to the
Data Engineering team if you have any queries or concerns about this
change, or would like help in identifying whether or not you are likely
to be affected.
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase, with the theme of *Wikimedia and LGBTQIA+*,
will be live-streamed Wednesday, June 21 at 16:30 UTC. Find your local time
here <https://zonestamp.toolforge.org/1687365012>.
YouTube stream: https://www.youtube.com/watch?v=AOD2ZdxRNfo
You can join the conversation on IRC at #wikimedia-research or on the
YouTube chat.
This month's presentations:
- *Multilingual Contextual Affective Analysis of LGBT People Portrayals
in Wikipedia*
- *Speaker*: Chan Park, Carnegie Mellon University
- *Abstract*: In this talk, I present our research on analyzing the
portrayal of LGBT individuals in their biographies on Wikipedia, with a
particular focus on subtle word connotations and cross-cultural
comparisons. We aim to address two primary research questions: 1) How can
we effectively measure the nuanced connotations of words in multilingual
texts, which reflect sentiments, power dynamics, and agency? 2)
How can we
analyze the portrayal of a specific group, such as the LGBT
community, and
compare these portrayals across different languages? To answer these
questions, we collect the Multilingual Contextualized Connotation Frames
dataset, comprising 2,700 examples in English, Spanish, and Russian. We
also develop a new multilingual model based on pre-trained multilingual
language models. Additionally, we devise a matching algorithm to
construct
a comparison corpus for the target corpus, isolating the attribute of
interest. Finally, we showcase how our developed models and constructed
corpora enable us to conduct cross-cultural analysis of LGBT People
Portrayals on Wikipedia. Our results reveal systematic differences in how
the LGBT community is portrayed across languages, surfacing cultural
differences in narratives and signs of social biases.
- *Paperː* Park, C. Y., Yan, X., Field, A., & Tsvetkov, Y. (2021,
May). Multilingual contextual affective analysis of LGBT people
portrayals
in Wikipedia. In Proceedings of the International AAAI Conference on Web
and Social Media (Vol. 15, pp. 479-490).
<https://arxiv.org/pdf/2010.10820.pdf>
- *Visual gender biases in Wikipediaː A systematic evaluation across the
ten most spoken languages*
- *Speaker*: Daniele Metilli, University College London
- *Abstract*: Wikidata Gender Diversity (WiGeDi) is a one-year
project funded through the Wikimedia Research Fund. The project
is studying
gender diversity in Wikidata, focusing on marginalized gender identities
such as those of trans and non-binary people, and adopting a queer and
intersectional feminist perspective. The project is organised in three
strands — model, data, and community. First, we are looking at how the
current Wikidata ontology model represents gender, and the
extent to which
this representation is inclusive of marginalized gender
identities. We are
analysing the data stored in the knowledge base to gather insights and
identify possible gaps and biases. Finally, we are looking at how the
community has handled the move towards the inclusion of a wider
spectrum of
gender identities by studying a corpus of user discussions through
computational linguistics methods. This presentation will report on the
current status of the Wikidata Gender Diversity project and the
envisioned
outcomes. We will discuss the main challenges that we are facing and the
opportunities that our project will potentially enable, on Wikidata and
beyond.
- *Paperː* Metilli D. & Paolini C. (in press). ‘Non-binary gender
representation in Wikidata’. In: Provo A., Burlingame K. & Watson B.M.
Ethics in Linked Data. Litwin Books. <https://wigedi.com/chapter.pdf>
You can watch our past Research Showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Hope you can join us!
Warm regards,
--
*Pablo Aragón (he/him)*
Research Scientist
Wikimedia Foundation
https://research.wikimedia.org
Hi all,
It seems like the Wikimedia AQS Pageviews API isn't returning data for
yesterday (2023-06-19). Is there any update on when that data will
be available?
Thanks,
Ben