Cross-post.
---------- Forwarded message ---------
From: Adam Baso <abaso(a)wikimedia.org>
Date: Fri, May 31, 2024 at 4:05 PM
Subject: (Possible breaking change) XML pages-articles dumps bug with
missing revision text for some records; fix in progress with schema change
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
As described on Phabricator a bug [1] surfaced whereby the "pages-articles"
XML dumps on https://dumps.wikimedia.org/ bear incomplete records.
A possible fix has been identified, and it involves bumping the dump schema
version from version 0.10 to version 0.11 [2], which could be a breaking
change for some.
MORE DETAILS:
Due to the bug that surfaced, a nontrivial number of <text> nodes
representing article text shows in a fashion like so as empty.
<text bytes="123456789" />
A potential fix in T365155 [3] has been identified. Assuming further
testing looks good, XML dumps will be kicked off again starting next week
in order to restore the missing records as soon as possible. It will take a
while for new dumps to be generated as it is a compute intensive operation.
More progress will be reported at T365155 and new dumps will eventually
show up on dumps.wikimedia.org .
Although a number of pipelines may not notice the change associated with
the schema bump, if your dump ingestion tooling or use of Special:Export
relies on the specific shape of the XML at version 0.10 (e.g., because of
code generation tools), please examine the differences between version 0.10
and version 0.11. One notable addition in version 0.11 is addition of MCR
[4] fields.
Thank you for your patience while this issue is resolved.
-Adam
[1]
https://phabricator.wikimedia.org/T365501
[2]
https://www.mediawiki.org/xml/export-0.10.xsd
and
https://www.mediawiki.org/xml/export-0.11.xsd
Schema version 0.11 has existed in MediaWiki for over 6 years, but
Wikimedia wikis have been using version 0.10.
[3]
https://phabricator.wikimedia.org/T365155#9851025
and
https://phabricator.wikimedia.org/T365155#9851160
[4]
https://www.mediawiki.org/wiki/Multi-Content_Revisions
Hello,
I would like to upgrade stat1008 from buster to bullseye this Thursday
at approximately 09:15 UTC.
The upgrade is expected to take up to an hour, during which time
stat1008 will be unavailable for use. Work in your home directories will
be left untouched, so the impact should be low, especially if you are
using conda environments
<https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Conda>.
If this maintenance window is likely to cause an issue for you, please
do let me know and I can look to reschedule the work. We will also be
available after the upgrade, in case you experience difficulties with
the upgraded operating system.
After the upgrade, stat1008 will have new SSH host fingerprints, so I
will update this page SSH_Fingerprints/stat1008.eqiad.wmnet
<https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1008.eqiad.wm…>
and provide some more help to get you reconnected.
Kind regards,
Ben
--
*Ben Tullis*(he/him)
Senior Site Reliability Engineer
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi all,
The next Research Showcase will be live-streamed tomorrow, Wednesday, May
15, at 9:30 AM PST / 16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1715790600>. The theme for this showcase is
*Reader to Editor Pipeline*.
You are welcome to watch via the YouTube stream:
https://www.youtube.com/watch?v=G-8CbpcwGV8. As usual, you can join the
conversation in the YouTube chat as soon as the showcase goes live.
This month's presentations:
Journey TransitionsBy *Mike Raish and Daisy Chen*What kinds of events do
readers and editors identify as separating the stages of their relationship
with Wikipedia, and which of these kinds of events might the Wikimedia
Foundation possibly support through design interventions? In the Journey
Transitions qualitative research project, the WMF Design Research team
interviewed readers and editors in Arabic, Spanish, and English in order to
answer these questions and provide guidance to WMF Product teams making
strategic decisions. A series of semi-structured interviews revealed that
readers and editors describe their relationships with Wikipedia in
different ways, with readers describing a static and transactional
relationship, and that even many experienced editors express confusion
about core functions of the Wikimedia ecosystem, such as the role of Talk
pages. This presentation will describe the Journey Transitions research, as
well as present its implications for the sponsoring Product teams in order
to shed light on the way that qualitative research is used to inform
strategic decisions in the Wikimedia Foundation.
Increasing participation in peer production communities with the Growth
featuresBy *Morten Warncke-Wang and Kirsten Stoller*For peer production
communities to be sustainable, they must attract and retain new
contributors. Studies have identified social and technical barriers to
entry and discovered some potential solutions, but these solutions have
typically focused on a single highly successful community, the English
Wikipedia, been tested in isolation, and rarely evaluated through
controlled experiments. In this talk, we show how the Wikimedia
Foundation’s Growth team collaborates with Wikipedia communities to develop
and experiment with new features to improve the newcomer experience in
Wikipedia. We report findings from a large-scale controlled experiment
using the Newcomer Homepage, a central place where newcomers can learn how
peer production works and find opportunities to contribute, and show how
the effectiveness depends on the newcomer’s context. Lastly, we show how
the Growth team has continued developing features that further improve the
newcomer experience while adapting to community needs.
Best,Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>