Hello,
You can ignore this email unless you use any of the Airflow
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow>
instances
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Instan…>
managed by the Data Platform Engineering team.
Since the recent Airflow upgrade to version 2.7.3 we discovered a
regression that affects all of our instances. It's a small bug
<https://github.com/apache/airflow/issues/36206> but it means that since
the upgrade users have been unable to add notes to any tasks. See:
T352534 <https://phabricator.wikimedia.org/T352534> for more detail.
In the short term, we have decided to implement a workaround, which is
to create an *admin:admin* user for each Airflow instance. You can use
this to log in if you wish to manage notes associated with your DAG run
tasks, the login link is at the top-right of the Airflow UI.
As the ability to access each Airflow instance is currently limited to
those with SSH access to the host, this change is not granting anyone
any additional rights that they do not already have. It's merely an
inconvenience, for which we apologise.
We have several longer-term options in mind, but I won't go into them here.
I have made a note of this configuration detail here, in case you would
like to refer to it again:
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow#Authen…
Naturally, please feel free to get in touch if you have any queries or
concerns about this.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello all users of Airflow,
We need to upgrade airflow on all of our Airflow instances, so I'm
scheduling a maintenance window for tomorrow, Wednesday November 29th at
10:30 UTC and I expect the work to take no more than 30 minutes.
I will pause all active DAGs on all Airflow instances prior to the work,
allow some time for running tasks to complete, then resume the DAGs
afterwards.
Naturally, you are also free to pause your own DAGs prior to the
maintenance and resume them afterwards, should you wish to minimize the
risk of disruption.
Please do let me know if there is anything specific that you would like
me to check, either before or after this maintenance.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase will be live-streamed on Wednesday, November 15,
at 9:30 AM PST / 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1700069400>. This showcase will focus on
*Bibliometrics*, just in time for the GLAM Wiki conference happening this
week in Montevideo.
YouTube stream: https://www.youtube.com/watch?v=IxNa6vgMCDY. As usual, you
can join the conversation in the YouTube chat as soon as the showcase goes
live.
This month's presentations:
Gender and country biases in Wikipedia citations to scholarly publications
By *Chaoqun Ni, University of Wisconsin-Madison*Ensuring Wikipedia cites
scholarly publications based on quality and relevancy without biases is
critical to credible and fair knowledge dissemination. We investigate
gender- and country-based biases in Wikipedia citation practices using
linked data from the Web of Science and a Wikipedia citation dataset. Using
coarsened exact matching, we show that publications by women are cited less
by Wikipedia than expected, and publications by women are less likely to be
cited than those by men. Scholarly publications by authors affiliated with
non-Anglosphere countries are also disadvantaged in getting cited by
Wikipedia, compared with those by authors affiliated with Anglosphere
countries. The level of gender- or country-based inequalities varies by
research field, and the gender-country intersectional bias is prominent in
math-intensive STEM fields. To ensure the credibility and equality of
knowledge presentation, Wikipedia should consider strategies and guidelines
to cite scholarly publications independent of the gender and country of
authors.Exploring Social Attention Dynamics through WikipediaBy *Wenceslao
Arroyo-Machado, Universidad de Granada*The untapped potential of Wikipedia
as a mirror of society's evolving interests and concerns is explored.
Recognizing Wikipedia as a vast, interactive repository of human knowledge,
the investigation focuses on how patterns of edits, views, and discussions
within Wikipedia articles, as well as their features, can serve as
real-time indicators of public interest and engagement. Key findings reveal
that Wikipedia is not just an information source but a reflection of
collective concerns, capturing significant trends and shifts in societal
focus. Additionally, it allows for the highlighting of both local and
international interests. These implications are far-reaching, offering
valuable insights for the Wikipedia community, academic researchers,
policymakers, and the general public. Understanding the dynamics of public
engagement on Wikipedia can inform content strategies, shape research
agendas, and guide public policy, while also providing a deeper
appreciation of the impact and significance of contributions made by the
global Wikipedia community.
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Best,
Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
We have to carry out some scheduled maintenance that will require a
brief period of disruption for our analytics_meta
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Analytics_Meta>
MariaDB service, whilst it is moved to a new primary host
<https://phabricator.wikimedia.org/T284150>. This will affect Hive, both
Druid clusters, Superset, Hue, and DataHub.
I plan to do this work tomorrow morning, starting shortly after 11:00
UTC and I expect the change to take no more than around 20 minutes,
during which time you might find that the above services are disrupted.
Our production pipelines that write to HDFS and Hive will be while the
work is being carried out.
I will likely also put HDFS briefly into Safe Mode, which prevents write
access, whilst I reconfigure and restart Hive.
If you could plan to work your own tasks and pipelines around this
maintenance window, I would be grateful. Please do get in touch if you
have any questions, or this maintenance plan will cause you any specific
inconvenience. If you think that you will be adversely affected I can
see whether it is possible to reschedule or find another workaround for you.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase, focused on *Data Privacy*, will be
live-streamed on Wednesday, October 18, at 9:30 AM PST / 16:30 UTC. Find
your local time here <https://zonestamp.toolforge.org/1697646641>.
YouTube stream: https://www.youtube.com/watch?v=ntgRsMaDlsw. As usual, you
can join the conversation in the YouTube chat as soon as the showcase goes
live.
This month's presentations:
Wikipedia Reader Navigation: When Synthetic Data Is EnoughBy *Akhil Arora,
EPFL*Every day millions of people read Wikipedia. When navigating the vast
space of available topics using hyperlinks, readers describe trajectories
on the article network. Understanding these navigation patterns is crucial
to better serve readers’ needs and address structural biases and knowledge
gaps. However, systematic studies of navigation on Wikipedia are hindered
by a lack of publicly available data due to the commitment to protect
readers' privacy by not storing or sharing potentially sensitive data. In
this paper, we ask: How well can Wikipedia readers' navigation be
approximated by using publicly available resources, most notably the
Wikipedia clickstream data <https://wikinav.toolforge.org/>? We
systematically quantify the differences between real navigation sequences
and synthetic sequences generated from the clickstream data, in 6 analyses
across 8 Wikipedia language versions. Overall, we find that the differences
between real and synthetic sequences are statistically significant, but
with small effect sizes, often well below 10%. This constitutes
quantitative evidence for the utility of the Wikipedia clickstream data as
a public resource: clickstream data can closely capture reader navigation
on Wikipedia and provides a sufficient approximation for most practical
downstream applications relying on reader data. More broadly, this study
provides an example for how clickstream-like data can generally enable
research on user navigation on online platforms while protecting users’
privacy.
How to tell the world about data you cannot show them: Differential privacy
at the Wikimedia FoundationBy *Hal Triedman, Wikimedia Foundation*The
Wikimedia Foundation (WMF), by virtue of its centrality on the internet,
collects lots of data about platform activities. Some of that data is made
public (e.g. global daily pageviews); other data types are not shared (or
are pseudonymized prior to sharing), largely due to privacy concerns.
Differential privacy is a statistical definition of privacy that has gained
prominence in academia, but is still an emerging technology in industry. In
this talk, I share the story of how we put differential privacy into
production at the WMF, through looking at the case study of geolocated
daily pageview counts.
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Best,
Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
We have to carry out a scheduled reboot of several of the analytics
client servers
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients>.
These servers are: stat100[4,6,7,9]
I'm planning to do this next Wednesday the 18th of October at
approximately 09:00 UTC.
Please do let me know if this will adversely impact your work and I will
try my best to work around your requirements.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello all users of Airflow,
We need to perform some scheduled maintenance on all of our Airflow
instances, so I'm scheduling a maintenance window for Tuesday October
17th at 09:00 UTC and I expect the work to take no more than 30 minutes.
I will pause all active DAGs on all Airflow instances prior to the work,
allow some time for running tasks to complete, then resume the DAGs
afterwards. Naturally, you are also free to pause your own DAGs prior to
the maintenance and resume them afterwards, should you wish to minimize
the risk of disruption.
Please do let me know if there is anything specific that you would like
me to check, either before or after this maintenance.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase, focused on *Rules on Wikipedia*, will be
live-streamed on Wednesday, September 20, at 9:30 AM PST / 16:30 UTC. Find
your local time here <https://zonestamp.toolforge.org/1695227400>.
YouTube stream: https://youtube.com/live/h89l9JWZBCU?feature=share
<https://www.google.com/url?q=https://youtube.com/live/h89l9JWZBCU?feature%3…>.
As usual, you can join the conversation in the YouTube chat as soon as the
showcase goes live.
This month's presentations:
Variation and overlap in the peer production of community rules: the case
of five WikipediasBy *Sohyeon Hwang, Northwestern University*
In this talk, I present work analyzing the rules and rule-making on
Wikipedia. The governance of many online communities relies on rules
created by participants. However, work predominantly focuses on efforts
within a single community or on a platform as a whole. Here we investigate
the comparative and relational dimensions of online self-governance in a
set of similar communities by looking at the five largest language editions
of Wikipedia. Using exhaustive trace data spanning almost 20 years since
their founding, we examine patterns in rule-making and overlaps in rule
sets. Our findings show that language editions have similar trajectories of
rule-making activity, replicating and extending a rich body of work that
have focused on English-language Wikipedia alone. We also find that the
language editions have increasingly unique rule sets, even as editing
activity concentrates on rules shared between them. The results suggest
that self-governing communities aligned in key ways may share a common core
of rules and rule-making practices even as they develop and sustain
institutional variations.
Wikipedia Community Policies and Experiential Epistemology: Critical
Information Literacy, Social Justice, and Inclusive PracticesBy *Zachary J.
McDowell, University of Illinois at Chicago*Drawing from a meta-analysis of
research on learning outcomes in Wikipedia-based education, this
presentation addresses Wikipedia community policies and practices through
the Framework for Information Literacy in Higher Education from the
Association of College and Research Libraries’ (ACRL). Wikipedia-based
educational practices, which promote newcomers’ active engagement in the
encyclopedia, have been shown to support experiential learnings in critical
information literacy, communication and research outcomes, and social
justice. Exploring the connections between participation in Wikipedia and
transferable skills for information literacy in the context of the current
new media landscape, this presentation grapples with new questions for the
future of information literacies alongside the implications of large
language models (LLMs), systemic biases, and the representation and
inclusion of non-western and indigenous knowledge sources.
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Shshowcase
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase>
Best,
Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello,
After experiencing some strange behavior re-fetching pageview data, I am
wondering if it is possible that the daily pageview count for an article
could change *after* the data is originally published to the API.
For example, if I fetch the daily pageviews on an article for the date
14-08-23, and then re-fetch the daily pageviews for the same article in the
future, is it expected that the value for 14-08-23 could be different?
Is there a backfill or correction process that can update daily pageview
counts for days that are already available via the API?
Any information is appreciated!
Thanks,
Duncan
--
Duncan Grubbs
Software Engineer
he/him
E: duncan(a)predata.com <first.lastname(a)predata.com>
Time Zone: ET (UTC-5/-4)
predata.com <https://www.predata.com/>