Cross-post.
---------- Forwarded message --------- From: Adam Baso abaso@wikimedia.org Date: Fri, May 31, 2024 at 4:05 PM Subject: (Possible breaking change) XML pages-articles dumps bug with missing revision text for some records; fix in progress with schema change To: Wikimedia developers wikitech-l@lists.wikimedia.org
As described on Phabricator a bug [1] surfaced whereby the "pages-articles" XML dumps on https://dumps.wikimedia.org/ bear incomplete records.
A possible fix has been identified, and it involves bumping the dump schema version from version 0.10 to version 0.11 [2], which could be a breaking change for some.
MORE DETAILS:
Due to the bug that surfaced, a nontrivial number of <text> nodes representing article text shows in a fashion like so as empty.
<text bytes="123456789" />
A potential fix in T365155 [3] has been identified. Assuming further testing looks good, XML dumps will be kicked off again starting next week in order to restore the missing records as soon as possible. It will take a while for new dumps to be generated as it is a compute intensive operation. More progress will be reported at T365155 and new dumps will eventually show up on dumps.wikimedia.org .
Although a number of pipelines may not notice the change associated with the schema bump, if your dump ingestion tooling or use of Special:Export relies on the specific shape of the XML at version 0.10 (e.g., because of code generation tools), please examine the differences between version 0.10 and version 0.11. One notable addition in version 0.11 is addition of MCR [4] fields.
Thank you for your patience while this issue is resolved.
-Adam
[1] https://phabricator.wikimedia.org/T365501
[2] https://www.mediawiki.org/xml/export-0.10.xsd
and
https://www.mediawiki.org/xml/export-0.11.xsd
Schema version 0.11 has existed in MediaWiki for over 6 years, but Wikimedia wikis have been using version 0.10.
[3] https://phabricator.wikimedia.org/T365155#9851025
and