Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20241101 full revision history content run.
We are currently dumping 1008 projects in total.
---------------------
Stats for simplewiki on date 20241101
Total size of page content dump files for articles, current content only:
1,440,017,120
Total size of page content dump files for all pages, current content only:
2,270,831,092
Total size of page content …
[View More]dump files for all pages, all revisions:
70,225,134,087
---------------------
Stats for enwiki on date 20241101
Total size of page content dump files for articles, current content only:
104,825,619,918
Total size of page content dump files for all pages, current content only:
215,523,184,051
Total size of page content dump files for all pages, all revisions:
29,837,076,901,944
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
[View Less]
Greetings XML Dump users,
TL;DR: We are pausing the XML Dumps effective from now to correct runtime errors that we suspect are causing bad data dumps. We are working on a fix.
Longer:
Over the past couple months, we have noticed a growing amount of runtime errors coming from the process that generate the XML content Dumps. It has always been the case that this process may have transient issues and may miss some revisions some months, but the current situation has become such a recurring …
[View More]problem that we now suspect data corruption in recent dumps.
We typically start a full dump (i.e. all revisions for all pages) on the 1st of the month for all wikis, and then we start a partial dump (i.e all current revisions for all pages) on the 20th of the month. Most of the October 1 2024 full runs are complete, except the French wiki and Wikidata wiki, which have failed for this month. The last successful copies of the French and Wikidata wikis are from September. All of the partial runs for the 20th of November are complete as well. However, any of these recent dumps may have underlying data quality issues.
In the interest of not dumping potentially bad data, we have decided to pause the XML Dumps, effective for all future dumps from the date of this communication, until we find and fix the root cause of these errors.
We acknowledge that many folks and downstream processes will be impacted and apologize for any inconvenience that this may cause you.
We are prioritizing this work, and if interested, you can follow updates at https://phabricator.wikimedia.org/T377594. Feel free to open additional tickets if your use cases are affected, and do please link them to the main ticket. Further, if you have the ability, we welcome data quality analysis of recent dumps that you may have noticed in your use cases.
[View Less]