Cross-post.
---------- Forwarded message ---------
From: Adam Baso <abaso(a)wikimedia.org>
Date: Fri, May 31, 2024 at 4:05 PM
Subject: (Possible breaking change) XML pages-articles dumps bug with
missing revision text for some records; fix in progress with schema change
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
As described on Phabricator a bug [1] surfaced whereby the "pages-articles"
XML dumps on https://dumps.wikimedia.org/ bear incomplete records.
A possible fix has been identified, and it involves bumping the dump schema
version from version 0.10 to version 0.11 [2], which could be a breaking
change for some.
MORE DETAILS:
Due to the bug that surfaced, a nontrivial number of <text> nodes
representing article text shows in a fashion like so as empty.
<text bytes="123456789" />
A potential fix in T365155 [3] has been identified. Assuming further
testing looks good, XML dumps will be kicked off again starting next week
in order to restore the missing records as soon as possible. It will take a
while for new dumps to be generated as it is a compute intensive operation.
More progress will be reported at T365155 and new dumps will eventually
show up on dumps.wikimedia.org .
Although a number of pipelines may not notice the change associated with
the schema bump, if your dump ingestion tooling or use of Special:Export
relies on the specific shape of the XML at version 0.10 (e.g., because of
code generation tools), please examine the differences between version 0.10
and version 0.11. One notable addition in version 0.11 is addition of MCR
[4] fields.
Thank you for your patience while this issue is resolved.
-Adam
[1]
https://phabricator.wikimedia.org/T365501
[2]
https://www.mediawiki.org/xml/export-0.10.xsd
and
https://www.mediawiki.org/xml/export-0.11.xsd
Schema version 0.11 has existed in MediaWiki for over 6 years, but
Wikimedia wikis have been using version 0.10.
[3]
https://phabricator.wikimedia.org/T365155#9851025
and
https://phabricator.wikimedia.org/T365155#9851160
[4]
https://www.mediawiki.org/wiki/Multi-Content_Revisions
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20240501 full revision history content run.
We are currently dumping 989 projects in total.
---------------------
Stats for huwiktionary on date 20240501
Total size of page content dump files for articles, current content only:
538,644,717
Total size of page content dump files for all pages, current content only:
761,139,853
Total size of page content dump files for all pages, all revisions:
4,623,785,423
---------------------
Stats for enwiki on date 20240501
Total size of page content dump files for articles, current content only:
99,831,963,975
Total size of page content dump files for all pages, current content only:
205,958,848,730
Total size of page content dump files for all pages, all revisions:
28,701,652,484,981
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hello,
Looking here https://dumps.wikimedia.org/wikidatawiki/entities/
I see the generation of Wikidata dumps are missing for this week
(20240520)
Could you have a look and keep an eye on next monday generation (0527) ?
Thank you,
JL
Hello.
The dump dewiki-20240520-pages-articles.xml contains many (96069 for ns 0) empty articles.
The first one is for <id>15</id>, the last one for <id>13102212</id>
For ns=0, this is a new phenomenon (introduced after 2024-03-01).
For all articles, the number of affected articles grew a lot:
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240520-pages-articles.xml
101259
# grep -c ' <text bytes="[0-9]*" />' dewiki-20240301-pages-articles.xml
129
Greetings
Sven
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20240401 full revision history content run.
We are currently dumping 982 projects in total.
---------------------
Stats for dkwikimedia on date 20240401
Total size of page content dump files for articles, current content only:
1,197,640
Total size of page content dump files for all pages, current content only:
2,454,891
Total size of page content dump files for all pages, all revisions:
106,011,011
---------------------
Stats for enwiki on date 20240401
Total size of page content dump files for articles, current content only:
99,343,047,221
Total size of page content dump files for all pages, current content only:
205,053,944,117
Total size of page content dump files for all pages, all revisions:
28,539,050,275,897
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hello, I'm working on a little project to perform cryptographically-sound
timestamping on Wikipedia snapshots. I'm using the opentimestamps.org
service, which by default uses the SHA-256 hash. In order to get the
SHA-256 for the timestamp, I need to download each file and compute the
hash.
Currently the xml data dumps provide only the MD5 and SHA-1 hashes. Both of
these hash functions are obsolete because they are cryptographically
broken. I'm wondering: would the maintainers of this service be willing to
add SHA-256 digests to the dumpstatus and checksum files going forward?
SHA-256 is still cryptographically sound and would allow me to verify that
I have the correct hash for timestamping.
Thanks in advance!
Best regards,
Arthur
I was reading about incremental xml dumps on this page: https://dumps.wikimedia.org/other/incr/
While I understand this service is experimental and may stop working at a given time, I was curious about how frequent the incremental dumps are when the system is working properly. Also, how common is it for the incremental dumps to stop working?