Cross-post.
---------- Forwarded message ---------
From: Adam Baso <abaso(a)wikimedia.org>
Date: Fri, May 31, 2024 at 4:05 PM
Subject: (Possible breaking change) XML pages-articles dumps bug with
missing revision text for some records; fix in progress with schema change
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
As described on Phabricator a bug [1] surfaced whereby the "pages-articles"
XML dumps on https://dumps.wikimedia.org/ bear incomplete records.
A possible fix has been identified, and it involves bumping the dump schema
version from version 0.10 to version 0.11 [2], which could be a breaking
change for some.
MORE DETAILS:
Due to the bug that surfaced, a nontrivial number of <text> nodes
representing article text shows in a fashion like so as empty.
<text bytes="123456789" />
A potential fix in T365155 [3] has been identified. Assuming further
testing looks good, XML dumps will be kicked off again starting next week
in order to restore the missing records as soon as possible. It will take a
while for new dumps to be generated as it is a compute intensive operation.
More progress will be reported at T365155 and new dumps will eventually
show up on dumps.wikimedia.org .
Although a number of pipelines may not notice the change associated with
the schema bump, if your dump ingestion tooling or use of Special:Export
relies on the specific shape of the XML at version 0.10 (e.g., because of
code generation tools), please examine the differences between version 0.10
and version 0.11. One notable addition in version 0.11 is addition of MCR
[4] fields.
Thank you for your patience while this issue is resolved.
-Adam
[1]
https://phabricator.wikimedia.org/T365501
[2]
https://www.mediawiki.org/xml/export-0.10.xsd
and
https://www.mediawiki.org/xml/export-0.11.xsd
Schema version 0.11 has existed in MediaWiki for over 6 years, but
Wikimedia wikis have been using version 0.10.
[3]
https://phabricator.wikimedia.org/T365155#9851025
and
https://phabricator.wikimedia.org/T365155#9851160
[4]
https://www.mediawiki.org/wiki/Multi-Content_Revisions
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20240501 full revision history content run.
We are currently dumping 989 projects in total.
---------------------
Stats for huwiktionary on date 20240501
Total size of page content dump files for articles, current content only:
538,644,717
Total size of page content dump files for all pages, current content only:
761,139,853
Total size of page content dump files for all pages, all revisions:
4,623,785,423
---------------------
Stats for enwiki on date 20240501
Total size of page content dump files for articles, current content only:
99,831,963,975
Total size of page content dump files for all pages, current content only:
205,958,848,730
Total size of page content dump files for all pages, all revisions:
28,701,652,484,981
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hello,
Looking here https://dumps.wikimedia.org/wikidatawiki/entities/
I see the generation of Wikidata dumps are missing for this week
(20240520)
Could you have a look and keep an eye on next monday generation (0527) ?
Thank you,
JL
Hello.
The dump dewiki-20240520-pages-articles.xml contains many (96069 for ns 0) empty articles.
The first one is for <id>15</id>, the last one for <id>13102212</id>
For ns=0, this is a new phenomenon (introduced after 2024-03-01).
For all articles, the number of affected articles grew a lot:
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240520-pages-articles.xml
101259
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240301-pages-articles.xml
129
Greetings
Sven