The current dump
enwiki-20160204-pages-articles.xml.bz2
contains duplicate pages. In particular, "Total Nonstop Action" and "Ida de Grey" appear twice.
Is this going to be fixed or should we assume that there might be duplicated pages in the dump? This never happened to us before.
Ciao,
seba
I will investigate this. Tracked at https://phabricator.wikimedia.org/T127832
Thank you for reporting.
Ariel
On Mon, Feb 22, 2016 at 10:47 PM, Sebastiano Vigna vigna@di.unimi.it wrote:
The current dump
enwiki-20160204-pages-articles.xml.bz2
contains duplicate pages. In particular, "Total Nonstop Action" and "Ida de Grey" appear twice.
Is this going to be fixed or should we assume that there might be duplicated pages in the dump? This never happened to us before.
Ciao,
seba
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 23 Feb 2016, at 14:43, Ariel Glenn WMF ariel@wikimedia.org wrote:
I will investigate this. Tracked at https://phabricator.wikimedia.org/T127832
I've seen no progress recently on the issue. Should we assume there will be duplicates?
Ciao,
seba
I've been trying to get the new hardware out for the monthly run. I'll be looking at this today and tomorrow to verify that the issue is really with separate page ranges being dumped for the same wiki without having the database frozen across the entire time of the run. If that's indeed the case, it's not fixable until we revisit the db backend, potentially a big job.
Ariel
On Fri, Apr 1, 2016 at 1:27 PM, Sebastiano Vigna vigna@di.unimi.it wrote:
On 23 Feb 2016, at 14:43, Ariel Glenn WMF ariel@wikimedia.org wrote:
I will investigate this. Tracked at
https://phabricator.wikimedia.org/T127832
I've seen no progress recently on the issue. Should we assume there will be duplicates?
Ciao,
seba
I've doublechecked that the duplicate pages are in fact in separate stub jobs and updated the phabricator task accordingly. As this is something that would be a lot of work to address with the current architecture and is on the drawing board for the Dumps 2.0 rewrite, I'm deferring the issue til then. In the meantime scripts or processors that work with multi-stub dumps should be prepared to filter out such duplicates, though they would be rare.
Ariel
On Mon, Apr 11, 2016 at 8:05 PM, Ariel Glenn WMF ariel@wikimedia.org wrote:
I've been trying to get the new hardware out for the monthly run. I'll be looking at this today and tomorrow to verify that the issue is really with separate page ranges being dumped for the same wiki without having the database frozen across the entire time of the run. If that's indeed the case, it's not fixable until we revisit the db backend, potentially a big job.
Ariel
On Fri, Apr 1, 2016 at 1:27 PM, Sebastiano Vigna vigna@di.unimi.it wrote:
On 23 Feb 2016, at 14:43, Ariel Glenn WMF ariel@wikimedia.org wrote:
I will investigate this. Tracked at
https://phabricator.wikimedia.org/T127832
I've seen no progress recently on the issue. Should we assume there will be duplicates?
Ciao,
seba
xmldatadumps-l@lists.wikimedia.org