Greetings XML Dump users,
TL;DR: We are pausing the XML Dumps effective from now to correct runtime errors that we suspect are causing bad data dumps. We are working on a fix.
Longer: Over the past couple months, we have noticed a growing amount of runtime errors coming from the process that generate the XML content Dumps. It has always been the case that this process may have transient issues and may miss some revisions some months, but the current situation has become such a recurring problem that we now suspect data corruption in recent dumps. We typically start a full dump (i.e. all revisions for all pages) on the 1st of the month for all wikis, and then we start a partial dump (i.e all current revisions for all pages) on the 20th of the month. Most of the October 1 2024 full runs are complete, except the French wiki and Wikidata wiki, which have failed for this month. The last successful copies of the French and Wikidata wikis are from September. All of the partial runs for the 20th of November are complete as well. However, any of these recent dumps may have underlying data quality issues. In the interest of not dumping potentially bad data, we have decided to pause the XML Dumps, effective for all future dumps from the date of this communication, until we find and fix the root cause of these errors. We acknowledge that many folks and downstream processes will be impacted and apologize for any inconvenience that this may cause you. We are prioritizing this work, and if interested, you can follow updates at https://phabricator.wikimedia.org/T377594. Feel free to open additional tickets if your use cases are affected, and do please link them to the main ticket. Further, if you have the ability, we welcome data quality analysis of recent dumps that you may have noticed in your use cases.
On Wed, 30 Oct 2024 at 03:27, ahoelzl@wikimedia.org wrote:
We typically start a full dump (i.e. all revisions for all pages) on the 1st of the month for all wikis, and then we start a partial dump (i.e all current revisions for all pages) on the 20th of the month. Most of the October 1 2024 full runs are complete, except the French wiki and Wikidata wiki, which have failed for this month. The last successful copies of the French and Wikidata wikis are from September. All of the partial runs for the 20th of November are complete as well. However, any of these recent dumps may have underlying data quality issues. In the interest of not dumping potentially bad data, we have decided to pause the XML Dumps, effective for all future dumps from the date of this communication, until we find and fix the root cause of these errors.
I guess "All of the partial runs for the 20th of November are complete as well" should say October, not November.
I notice commonswiki dump shows as aborted. Is this related to the above errors?
Cheers
On Wed, Oct 30, 2024 at 10:37 PM Platonides platonides@gmail.com wrote:
On Wed, 30 Oct 2024 at 03:27, ahoelzl@wikimedia.org wrote:
We typically start a full dump (i.e. all revisions for all pages) on the 1st of the month for all wikis, and then we start a partial dump (i.e all current revisions for all pages) on the 20th of the month. Most of the October 1 2024 full runs are complete, except the French wiki and Wikidata wiki, which have failed for this month. The last successful copies of the French and Wikidata wikis are from September. All of the partial runs for the 20th of November are complete as well. However, any of these recent dumps may have underlying data quality issues. In the interest of not dumping potentially bad data, we have decided to pause the XML Dumps, effective for all future dumps from the date of this communication, until we find and fix the root cause of these errors.
I guess "All of the partial runs for the 20th of November are complete as well" should say October, not November.
Yes you are right. *October*.
I notice commonswiki dump shows as aborted. Is this related to the above errors?
It is related. We decided to stop all XML dumps processes, and commonswiki happened to still be running. Thus the statement above should have read: "All of the partial runs for the 20th of *October* are complete as well*, except for commonswiki, which was aborted.*"
Cheers
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
We are happy to announce that the dumps pipeline has been resumed as of November 8th. The November dumps (October data) are running with the fix on a schedule delayed by about a week. Most dumps are complete with wikidatawiki still being in progress (Nov 14th). Logs show that the fix is working well and we believe the new dumps to be accurate. With that, we consider the current incident resolved and dumps should be provided on the regular cadence going forward. https://phabricator.wikimedia.org/T377594
Thanks for your hard work fixing this!
On Thu, 14 Nov 2024 at 22:46, ahoelzl@wikimedia.org wrote:
We are happy to announce that the dumps pipeline has been resumed as of November 8th. The November dumps (October data) are running with the fix on a schedule delayed by about a week. Most dumps are complete with wikidatawiki still being in progress (Nov 14th). Logs show that the fix is working well and we believe the new dumps to be accurate. With that, we consider the current incident resolved and dumps should be provided on the regular cadence going forward. https://phabricator.wikimedia.org/T377594 _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
xmldatadumps-l@lists.wikimedia.org