Hello,
We have to perform some scheduled maintenance of the analytics client
servers <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients>,
which are named stat100[4-9].
This maintenance requires a reboot of each server, so I'm planning to do
this next Monday the 14th of August at approximately 09:00 UTC.
I'll reboot each of the five servers in numeric sequence and I expect
the work to take no more than 1 hour in total.
Please do let me know if this will adversely impact your work and I will
try my best to work around your requirements.
I'll send another announcement nearer the time, as a reminder to save
any work that you may have in progress on these servers.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello all users of Airflow,
We need to perform some scheduled maintenance on all of our Airflow
instances, so I'm scheduling a maintenance window for tomorrow at 08:30
UTC and I expect the work to take no more than 30 minutes. The work
involves a reboot of the shared PostgreSQL database that serves all of
our instances, as well as a reboot of some instances themselves.
I will pause all active DAGs on all Airflow instances prior to the work,
allow some time for running tasks to complete, then un-pause the DAGs
afterwards.
Naturally, you are also free to pause your own DAGs prior to the
maintenance and un-pause them afterwards, should you wish to minimize
the risk of disruption.
Please do let me know if there is anything specific that you would like
me to check, either before or after this maintenance.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Dear members of the Analytics Team,
I am currently conducting research about the excludability of free
knowledge available on the Wikimedia projects as an example of a public
good. In order to calibrate the model, I need aggregate data on the page
views and edits by country and language.
After having carefully read Research:Data
<https://meta.wikimedia.org/wiki/Research:Data>, I was only able to find
data on page views by country and language, which would be enough to
calibrate the demand side of my model. So, is it possible to get aggregate
data on edits by country and language, which are similar to those on page
views available at WikiStats?
Thanks in advance.
Best regards,
Kiril Simeonovski
Hello everyone,
The next Research Showcase, focused on *Improving knowledge integrity in
Wikimedia projects*, will be live-streamed Wednesday, July 19, at 9:30 AM
PST / 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1689784256>.
The event is on the WMF Staff Calendar.
YouTube stream: https://youtube.com/live/_8DevIsi44s?feature=share
<https://www.google.com/url?q=https://youtube.com/live/_8DevIsi44s?feature%3…>
You can join the conversation on IRC at #wikimedia-research. You can also
watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
Assessment of Reference Quality on WikipediaBy *Aitolkyn Baigutanova, KAIST*In
this talk, I will present our research on the reliability of Wikipedia
through the lens of its references. I will primarily discuss our paper on
the longitudinal assessment of reference quality on English Wikipedia,
where we operationalize the notion of reference quality by defining
reference need (RN), i.e., the percentage of sentences missing a citation,
and reference risk (RR), i.e., the proportion of non-authoritative
references. I will share our research findings on two key aspects: (1) the
evolution of reference quality over a 10-year period and (2) factors that
affect reference quality. We discover that the RN score has dropped by 20
percent point, with more than half of verifiable statements now
accompanying references. The RR score has remained below 1% over the years
as a result of the efforts of the community to eliminate unreliable
references. As an extension of this work, we explore how community
initiatives, such as the perennial source list, help with maintaining
reference quality across multiple language editions of Wikipedia. We hope
our work encourages more active discussions within Wikipedia communities to
improve reference quality of the content.
- Paper: Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper,
Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023.
Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings
of the ACM Web Conference 2023 (WWW '23). Association for Computing
Machinery, New York, NY, USA, 2831–2839.
<https://dl.acm.org/doi/abs/10.1145/3543507.3583218>
Multilingual approaches to support knowledge integrity in WikipediaBy *Diego
Saez-Trumper & Pablo Aragón, Wikimedia Foundation*Knowledge integrity in
Wikipedia is key to ensure the quality and reliability of information. For
that reason, editors devote a substantial amount of their time in
patrolling tasks in order to detect low-quality or misleading content. In
this talk we will cover recent multilingual approaches to support knowledge
integrity. First, we will present a novel design of a system aimed at
assisting the Wikipedia communities in addressing vandalism. This system
was built by collecting a massive dataset of multiple languages and then
applying advanced filtering and feature engineering techniques, including
multilingual masked language modeling to build the training dataset from
human-generated data. Second, we will showcase the Wikipedia Knowledge
Integrity Risk Observatory, a dashboard that relies on a language-agnostic
version of the former system to monitor high risk content in hundreds of
Wikipedia language editions. We will conclude with a discussion of
different challenges to be addressed in future work.
- Papers:
Trokhymovych, M., Aslam, M., Chou, A. J., Baeza-Yates, R., & Saez-Trumper,
D. (2023). Fair multilingual vandalism detection system for Wikipedia.
arXiv e-prints, arXiv-2306. https://arxiv.org/pdf/2306.01650.pdfAragón, P.,
& Sáez-Trumper, D. (2021). A preliminary approach to knowledge integrity
risk assessment in Wikipedia projects. arXiv preprint arXiv:2106.15940.
Best,
Kinneret
--
Kinneret Gordon
Senior Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
Deprecation of Spark v2 scheduled for July 5th
The Data Engineering team is planning to deprecate Spark 2 on July 5th
2023. Its replacement, Spark 3 is already available and all of our
production data pipelines have been migrated successfully to this new
version
<https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Spark/Migration_to_Sp…>.
We have also assisted in the migration of several other teams’ Spark 2
pipelines to Spark 3, but there may still be other Spark 2 jobs that are
configured in code outside of our control.
We encourage you, therefore, to review any of your own Spark jobsthat
you run, to verify that they have been upgraded to work with Spark 3. In
most cases, this will mean checking that the command-line interfaces for
spark
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Spark#…>use
one of the supported forms, such as spark3-submitor pyspark3. In some
cases this may also mean upgrading your conda environments
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda#Migratin…>on
the stats servers from anaconda-wmfto conda-analytics, if you have not
already done so.
The specific change that is scheduled to happen on July 5th is a switch
of spark shuffler version used by YARN
<https://phabricator.wikimedia.org/T332765>from 2 to 3. This should
bring significant performance benefits for existing spark3 jobs, but it
is more than likely that any spark2 jobs attempting to use this new
shuffler will fail.
Please do reach out
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Contact>to the
Data Engineering team if you have any queries or concerns about this
change, or would like help in identifying whether or not you are likely
to be affected.
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase, with the theme of *Wikimedia and LGBTQIA+*,
will be live-streamed Wednesday, June 21 at 16:30 UTC. Find your local time
here <https://zonestamp.toolforge.org/1687365012>.
YouTube stream: https://www.youtube.com/watch?v=AOD2ZdxRNfo
You can join the conversation on IRC at #wikimedia-research or on the
YouTube chat.
This month's presentations:
- *Multilingual Contextual Affective Analysis of LGBT People Portrayals
in Wikipedia*
- *Speaker*: Chan Park, Carnegie Mellon University
- *Abstract*: In this talk, I present our research on analyzing the
portrayal of LGBT individuals in their biographies on Wikipedia, with a
particular focus on subtle word connotations and cross-cultural
comparisons. We aim to address two primary research questions: 1) How can
we effectively measure the nuanced connotations of words in multilingual
texts, which reflect sentiments, power dynamics, and agency? 2)
How can we
analyze the portrayal of a specific group, such as the LGBT
community, and
compare these portrayals across different languages? To answer these
questions, we collect the Multilingual Contextualized Connotation Frames
dataset, comprising 2,700 examples in English, Spanish, and Russian. We
also develop a new multilingual model based on pre-trained multilingual
language models. Additionally, we devise a matching algorithm to
construct
a comparison corpus for the target corpus, isolating the attribute of
interest. Finally, we showcase how our developed models and constructed
corpora enable us to conduct cross-cultural analysis of LGBT People
Portrayals on Wikipedia. Our results reveal systematic differences in how
the LGBT community is portrayed across languages, surfacing cultural
differences in narratives and signs of social biases.
- *Paperː* Park, C. Y., Yan, X., Field, A., & Tsvetkov, Y. (2021,
May). Multilingual contextual affective analysis of LGBT people
portrayals
in Wikipedia. In Proceedings of the International AAAI Conference on Web
and Social Media (Vol. 15, pp. 479-490).
<https://arxiv.org/pdf/2010.10820.pdf>
- *Visual gender biases in Wikipediaː A systematic evaluation across the
ten most spoken languages*
- *Speaker*: Daniele Metilli, University College London
- *Abstract*: Wikidata Gender Diversity (WiGeDi) is a one-year
project funded through the Wikimedia Research Fund. The project
is studying
gender diversity in Wikidata, focusing on marginalized gender identities
such as those of trans and non-binary people, and adopting a queer and
intersectional feminist perspective. The project is organised in three
strands — model, data, and community. First, we are looking at how the
current Wikidata ontology model represents gender, and the
extent to which
this representation is inclusive of marginalized gender
identities. We are
analysing the data stored in the knowledge base to gather insights and
identify possible gaps and biases. Finally, we are looking at how the
community has handled the move towards the inclusion of a wider
spectrum of
gender identities by studying a corpus of user discussions through
computational linguistics methods. This presentation will report on the
current status of the Wikidata Gender Diversity project and the
envisioned
outcomes. We will discuss the main challenges that we are facing and the
opportunities that our project will potentially enable, on Wikidata and
beyond.
- *Paperː* Metilli D. & Paolini C. (in press). ‘Non-binary gender
representation in Wikidata’. In: Provo A., Burlingame K. & Watson B.M.
Ethics in Linked Data. Litwin Books. <https://wigedi.com/chapter.pdf>
You can watch our past Research Showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Hope you can join us!
Warm regards,
--
*Pablo Aragón (he/him)*
Research Scientist
Wikimedia Foundation
https://research.wikimedia.org
Hi all,
It seems like the Wikimedia AQS Pageviews API isn't returning data for
yesterday (2023-06-19). Is there any update on when that data will
be available?
Thanks,
Ben
Hello,
We need to schedule a reboot of the servers that provide copies of the
Mediawiki databases for analytics purposes.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB
These are the servers: dbstore1003,dbstore1005, and dbstore1007.
I'm intending to carry out this work at 09:30 UTC next Tuesday the 9th
of May. I will restart all three servers in succession, so I expect the
maintenance to be complete within approximately 30 minutes.
Please note that the Wiki Replica databases are not affected by this
maintenance: https://wikitech.wikimedia.org/wiki/Wiki_Replicas
Please do let me know if you have any queries or if this choice of
maintenance window is likely to cause you any inconvenience.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase, with the theme of Images on Wikipedia, will be
live-streamed Wednesday, April 19, at 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1681921857>.
YouTube stream: https://www.youtube.com/watch?v=vW0waU-QArU
You can join the conversation on IRC at #wikimedia-research or on the
YouTube chat.
This month's presentations:
A large scale study of reader interactions with images on WikipediaBy *Daniele
Rama, University of Turin*Wikipedia is the largest source of free
encyclopedic knowledge and one of the most visited sites on the Web. To
increase reader understanding of the article, Wikipedia editors add images
within the text of the article’s body. However, despite their widespread
usage on web platforms and the huge volume of visual content on Wikipedia,
little is known about the importance of images in the context of free
knowledge environments. To bridge this gap, we collect data about English
Wikipedia reader interactions with images during one month and perform the
first large-scale analysis of how interactions with images happen on
Wikipedia. First, we quantify the overall engagement with images, finding
that one in 29 pageviews results in a click on at least one image, one
order of magnitude higher than interactions with other types of article
content. Second, we study what factors associate with image engagement and
observe that clicks on images occur more often in shorter articles and
articles about visual arts or transports and biographies of less well-known
people. Third, we look at interactions with Wikipedia article previews and
find that images help support reader information need when navigating
through the site, especially for more popular pages. The findings in this
study deepen our understanding of the role of images for free knowledge and
provide a guide for Wikipedia editors and web user communities to enrich
the world’s largest source of encyclopedic knowledge.
- Paperː
https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-021-0…
Visual gender biases in Wikipediaː A systematic evaluation across the ten
most spoken languagesBy *Pablo Beytia, Catholic University of Chile*The
existing research suggests a significant gender gap in Wikipedia
biographical articles, with a minimal representation of women and gender
asymmetries in the textual content. However, the visual aspects of this gap
(e.g., image volume and quality) have received little attention. This study
examined asymmetries between women's and men's biographies, exploring
written and visual content across the ten most widely spoken languages. The
cross-lingual analysis reveals that (1) the most salient male biases appear
when editors select which personalities should have a Wikipedia page, (2)
the trends in written and visual content are dissimilar, (3) male
biographies tend to have more images across languages, and (4) female
biographies have better visual quality on average. The open database of
this study provides eight indicators of gender asymmetries in ten
occupational domains and ten languages. That information allows for a
granular view of gender biases, as well as exploring more macroscopic
phenomena, such as the similarity between Wikipedia versions according to
their gender bias structures.
- Papersː
Beytía, P., Agarwal, P., Redi, M., & Singh, V. K. (2022). Visual Gender
Biases in Wikipedia: A Systematic Evaluation across the Ten Most Spoken
Languages. Proceedings of the International AAAI Conference on Web and
Social Media, 16(1), 43-54. https://doi.org/10.1609/icwsm.v16i1.19271https://ojs.aaai.org/index.php/ICWSM/article/view/19271Beytía, P. & Wagner,
C. (2022). Visibility layers: a framework for systematizing the gender gap
in Wikipedia content. Internet Policy Review, 11(1).
https://doi.org/10.14763/2022.1.1621https://policyreview.info/articles/analysis/visibility-layers-framework-sys…
You can watch our past Research Showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Hope you can join us!
Warm regards,
Emily
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
Hello,
Apologies for the short notice. The SRE team will be carrying out an
upgrade of the switches in eqiad row D later today
(https://phabricator.wikimedia.org/T333377) at approximately 14:00 UTC.
The network outage to this row resulting from this work is expected to
be around 30 minutes, all being well.
In support of this work, the Data Engineering team will be putting HDFS
file system into safe mode at approximately 13:30 today, which means
that write operations to the cluster will be refused.
Jobs sent to the YARN cluster will also be refused from around the same
time, so please try to plan any work that you may have for the cluster
to avoid this maintenance window.
Read-only access to Hive, Presto, Superset, Turnilo, should continue to
function normally throughout the maintenance window.
Finally, two of the stats servers (stat1005 and stat1006) will be
unavailable, so please save any work that you may have on these servers
before the loss of connectivity.
Please do reach out via any of the normal channels (email:
analytics(a)lists.wikimedia.org , IRC: #wikimedia-analytics , Slack
#data-engineering ) if you have any queries or concerns.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>