Greetings XML Dump users,
TL;DR: We are pausing the XML Dumps effective from now to correct runtime errors that we suspect are causing bad data dumps. We are working on a fix.
Longer: Over the past couple months, we have noticed a growing amount of runtime errors coming from the process that generate the XML content Dumps. It has always been the case that this process may have transient issues and may miss some revisions some months, but the current situation has become such a recurring problem that we now suspect data corruption in recent dumps. We typically start a full dump (i.e. all revisions for all pages) on the 1st of the month for all wikis, and then we start a partial dump (i.e all current revisions for all pages) on the 20th of the month. Most of the October 1 2024 full runs are complete, except the French wiki and Wikidata wiki, which have failed for this month. The last successful copies of the French and Wikidata wikis are from September. All of the partial runs for the 20th of November are complete as well. However, any of these recent dumps may have underlying data quality issues. In the interest of not dumping potentially bad data, we have decided to pause the XML Dumps, effective for all future dumps from the date of this communication, until we find and fix the root cause of these errors. We acknowledge that many folks and downstream processes will be impacted and apologize for any inconvenience that this may cause you. We are prioritizing this work, and if interested, you can follow updates at https://phabricator.wikimedia.org/T377594. Feel free to open additional tickets if your use cases are affected, and do please link them to the main ticket. Further, if you have the ability, we welcome data quality analysis of recent dumps that you may have noticed in your use cases.