Cross-post.
---------- Forwarded message ---------
From: Adam Baso <abaso(a)wikimedia.org>
Date: Fri, May 31, 2024 at 4:05 PM
Subject: (Possible breaking change) XML pages-articles dumps bug with
missing revision text for some records; fix in progress with schema change
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
As described on Phabricator a bug [1] surfaced whereby the "pages-articles"
XML dumps on https://dumps.wikimedia.org/ bear incomplete records.
A possible fix has been identified, and it involves bumping the dump schema
version from version 0.10 to version 0.11 [2], which could be a breaking
change for some.
MORE DETAILS:
Due to the bug that surfaced, a nontrivial number of <text> nodes
representing article text shows in a fashion like so as empty.
<text bytes="123456789" />
A potential fix in T365155 [3] has been identified. Assuming further
testing looks good, XML dumps will be kicked off again starting next week
in order to restore the missing records as soon as possible. It will take a
while for new dumps to be generated as it is a compute intensive operation.
More progress will be reported at T365155 and new dumps will eventually
show up on dumps.wikimedia.org .
Although a number of pipelines may not notice the change associated with
the schema bump, if your dump ingestion tooling or use of Special:Export
relies on the specific shape of the XML at version 0.10 (e.g., because of
code generation tools), please examine the differences between version 0.10
and version 0.11. One notable addition in version 0.11 is addition of MCR
[4] fields.
Thank you for your patience while this issue is resolved.
-Adam
[1]
https://phabricator.wikimedia.org/T365501
[2]
https://www.mediawiki.org/xml/export-0.10.xsd
and
https://www.mediawiki.org/xml/export-0.11.xsd
Schema version 0.11 has existed in MediaWiki for over 6 years, but
Wikimedia wikis have been using version 0.10.
[3]
https://phabricator.wikimedia.org/T365155#9851025
and
https://phabricator.wikimedia.org/T365155#9851160
[4]
https://www.mediawiki.org/wiki/Multi-Content_Revisions
Hello,
I would like to upgrade stat1008 from buster to bullseye this Thursday
at approximately 09:15 UTC.
The upgrade is expected to take up to an hour, during which time
stat1008 will be unavailable for use. Work in your home directories will
be left untouched, so the impact should be low, especially if you are
using conda environments
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda>.
If this maintenance window is likely to cause an issue for you, please
do let me know and I can look to reschedule the work. We will also be
available after the upgrade, in case you experience difficulties with
the upgraded operating system.
After the upgrade, stat1008 will have new SSH host fingerprints, so I
will update this page SSH_Fingerprints/stat1008.eqiad.wmnet
<https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1008.eqiad.wm…>
and provide some more help to get you reconnected.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase will be live-streamed tomorrow, Wednesday, May
15, at 9:30 AM PST / 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1715790600>. The theme for this showcase is
*Reader to Editor Pipeline*.
You are welcome to watch via the YouTube stream:
https://www.youtube.com/watch?v=G-8CbpcwGV8. As usual, you can join the
conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations:
Journey TransitionsBy *Mike Raish and Daisy Chen*What kinds of events do
readers and editors identify as separating the stages of their relationship
with Wikipedia, and which of these kinds of events might the Wikimedia
Foundation possibly support through design interventions? In the Journey
Transitions qualitative research project, the WMF Design Research team
interviewed readers and editors in Arabic, Spanish, and English in order to
answer these questions and provide guidance to WMF Product teams making
strategic decisions. A series of semi-structured interviews revealed that
readers and editors describe their relationships with Wikipedia in
different ways, with readers describing a static and transactional
relationship, and that even many experienced editors express confusion
about core functions of the Wikimedia ecosystem, such as the role of Talk
pages. This presentation will describe the Journey Transitions research, as
well as present its implications for the sponsoring Product teams in order
to shed light on the way that qualitative research is used to inform
strategic decisions in the Wikimedia Foundation.
Increasing participation in peer production communities with the Growth
featuresBy *Morten Warncke-Wang and Kirsten Stoller*For peer production
communities to be sustainable, they must attract and retain new
contributors. Studies have identified social and technical barriers to
entry and discovered some potential solutions, but these solutions have
typically focused on a single highly successful community, the English
Wikipedia, been tested in isolation, and rarely evaluated through
controlled experiments. In this talk, we show how the Wikimedia
Foundation’s Growth team collaborates with Wikipedia communities to develop
and experiment with new features to improve the newcomer experience in
Wikipedia. We report findings from a large-scale controlled experiment
using the Newcomer Homepage, a central place where newcomers can learn how
peer production works and find opportunities to contribute, and show how
the effectiveness depends on the newcomer’s context. Lastly, we show how
the Growth team has continued developing features that further improve the
newcomer experience while adapting to community needs.
Best,Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello. I'm planning to shut down stat1010 tomorrow, to allow DC Ops to
connect power to its GPU card for T336040
<https://phabricator.wikimedia.org/T336040>. We tried to do this work a
couple of weeks ago, but it turned out that the cable had not arrived.
We're pretty confident about it this time.
I expect that it will be around 13:30 UTC and the outage for stat1010
will last up to 30 minutes.
If you can plan to use a different stat100* server while the work is
carried out, that would be very helpful.
On the other hand, if this planned maintenance will impact your work and
you can't work around it, please let me know and I will defer the power
down.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi everyone,
The next Research Showcase will be live-streamed tomorrow Wednesday,
April 17, at 9:30 AM PST / 16:30 UTC. Find your local time here. The
theme for this showcase is Supporting Multimedia on Wikipedia.
You are welcome to watch via the YouTube stream:
https://www.youtube.com/watch?v=wpSQD9Bc8Ek. As usual, you can join
the conversation in the YouTube chat as soon as the showcase goes
live.
This month's presentations:
Towards image accessibility solutions grounded in communicative principles
By Elisa Kreiss
Images have become an omnipresent communicative tool -- and this is no
exception on Wikipedia. However, the undeniable benefits they carry
for sighted communicators turns into a serious accessibility challenge
for people who are blind or have low vision (BLV). BLV users often
have to rely on textual descriptions of those images to equally
participate in an ever-increasing image-dominated online lifestyle. In
this talk, I will present how framing accessibility as a communication
problem highlights important ways forward in redefining image
accessibility on Wikipedia. I will present the Wikipedia-based dataset
Concadia and use it to discuss the successes and shortcomings of image
captions and alt texts for accessibility, and how the usefulness of
accessibility descriptions is fundamentally contextual. I will
conclude by highlighting the potential and risks of AI-based solutions
and discussing implications for different Wikipedia editing
communities.
Automatic Multi-Path Web Story Creation from a Structural Article
By Daniel Nkemelu
Web articles such as Wikipedia serve as one of the major sources of
knowledge dissemination and online learning. However, their in-depth
information--often in a dense text format--may not be suitable for
mobile browsing, even in a responsive user interface. We propose an
automatic approach that converts a structured article of any length
into a set of interactive Web Stories that are ideal for mobile
experiences. We focused on Wikipedia articles and developed
Wiki2Story, a pipeline based on language and layout models, to
demonstrate the concept. Wiki2Story dynamically slices an article and
plans one to multiple Story paths according to the document hierarchy.
For each slice, it generates a multi-page summary Story composed of
text and image pairs in visually appealing layouts. We derived design
principles from an analysis of manually created Story practices. We
executed our pipeline on 500 Wikipedia documents and conducted user
studies to review selected outputs. Results showed that Wiki2Story
effectively captured and presented salient content from the original
articles and sparked interest in viewers.
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation
Hello. I'm planning to shut down stat1010 later today, to allow DC Ops
to connect power to its GPU card for T336040
<https://phabricator.wikimedia.org/T336040>.
The exact window will depend on when they are available, but I would
expect that it will be around 13:30 UTC and last up to 30 minutes.
If you can plan to use a different stat100* server while the work is
carried out, that would be very helpful.
On the other hand, if this planned maintenance will impact your work and
you can't work around it, please let me know and I will defer the power
down.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi everyone,
The next Research Showcase will be live-streamed on Wednesday, March 20, at
9:30 AM PST / 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1710952200>. In line with Women's History
Month, the theme for this showcase is *Addressing Knowledge Gaps*.
You are welcome to watch via the YouTube stream:
https://www.youtube.com/watch?v=D6wrr9WShTk. As usual, you can join the
conversation in the YouTube chat as soon as the showcase goes live.
This month's presentation:
Leveraging Recommender Systems to Reduce Content Gaps on WikipediaBy *Mo
Houtti*Many Wikipedians use algorithmic recommender systems to help them
find interesting articles to edit. The algorithms underlying those systems
are driven by a straightforward assumption: we can look at what someone
edited in the past to figure out what they’ll most likely want to edit
next. But the story of what Wikipedians want to edit is almost definitely
more complex than that. For example, our own prior research shows that
Wikipedians prefer prioritizing articles that would minimize content gaps.
So, we asked, what would happen if we incorporated that value into
Wikipedians’ personalized recommendations? Through a controlled experiment
on SuggestBot, we found that recommending more content gap articles didn’t
significantly impact editing, despite those articles being less “optimally
interesting” according to the recommendation algorithm. In this
presentation, I will describe our experiment, our results, and their
implications - including how recommender systems can be one useful strategy
for tackling content gaps on Wikipedia.Bridging the offline and online-
Offline meetings of WikipediansBy *Nicole Schwitter*Wikipedia is primarily
known as an online encyclopaedia, but it also features a noteworthy offline
component: Wikipedia and particularly its German-language edition – which
is one of the largest and most active language versions – is characterised
by regular local offline meetups which give editors the chance to get to
know each other. This talk will present the recently published dewiki
meetup dataset which covers (almost) all offline gatherings organised on
the German-language version of Wikipedia. The dataset covers almost 20
years of offline activity of the German-language Wikipedia, containing 4418
meetups that have been organised with information on attendees, apologies,
date and place of meeting, and minutes recorded. The talk will explain how
the dataset can be used for research, highlight the importance of
considering offline meetings among Wikipedians, and place these insights
within the context of addressing gender gaps within Wikipedia.
Best,
Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello (especially to Superset users),
As you may know, the Data Platform SRE team is currently working on
migrating the Analytics Superset instances to Kubernetes (under ticket
T347710 <https://phabricator.wikimedia.org/T347710>) and, happily, I can
report that we are making good progress.
This is just a courtesy email to let you know that we plan to switch our
staging instance (superset-next.wikimedia.org
<https://superset-next.wikimedia.org>) to over to Kubernetes over the
next day or two. This is unlikely to affect anyone's work at the moment,
given that both the staging and production instances of Superset have
been on version 3.1.0 for a while.
However, given that this staging instance is available for you to use at
any time, we thought it best to let you know that we are currently
working on it and that it may be in a state of flux for a while.
Once it is stable on Kubernetes, we may well contact you again and ask
you kindly to test superset-next for us and report your findings. At the
moment though, we're just working on the transition itself so there
won't be much for you to test.
As ever, if you have any queries or concerns, please don't hesitate to
let us know.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>