Hi. I've noticed that some June XML data dumps have duplicate <page> records, usually at the end of the dump.
Anyone know if this is intentional? One or two duplicate records is benign, but I'm slightly concerned that it may be a symptom of a larger problem. I've been working with the XML data dumps for over 3 years, and haven't seen this before.[1]
I list some examples below. They're only from the Swedish wikis and Spanish Wikipedia (which is what I started looking at this week) Let me know if you need any other info, and I'll be happy to provide.
Finally, for questions like these, is it best to email the mailing list, create a task in Phabricator or do both?
Thanks.
[1]: It may have started as recently as 2015 April. I stopped looking at dumps shortly before the May problems with the dump server.
----
Example 1: URL: http://dumps.wikimedia.org/svwikiversity/20150602/svwikiversity-20150602-pag... Title: Audi m8 ID: 18942 SHA1: gd16v3qkmjr2w2j35zhqitjfg97igjt) Note: Last article in dump. Repeated twice
Example 2: URL: http://dumps.wikimedia.org/svwikiquote/20150602/svwikiquote-20150602-pages-a... Title: Sommarens tolv månader ID: 6209 SHA1: 9yibnev7pn3atxicayjoay0ave7pcu6 Note: Last article in dump. Repeated twice
Example 3: URL: http://dumps.wikimedia.org/svwikibooks/20150602/svwikibooks-20150602-pages-a... Title: Topologi/Metriska rum ID: 10001 SHA1: 5zdkpxflzdxhy7gxclludnlasvl6tw3 Note: Last article in dump. Repeated twice
Example 4: URL: http://dumps.wikimedia.org/svwikisource/20150602/svwikisource-20150602-pages... Title: Afhandling om svenska stafsättet/4 ID: 88768 SHA1: 7zyj208ur4vit0t41z7xlftlyl69bo7 Note: Last article in dump. Repeated twice
Example 5: URL: http://dumps.wikimedia.org/eswiki/20150602/eswiki-20150602-pages-articles.xm... Title (1): Veguer Title (2): Promo Note: duplicates are earlier in the dump (Veguer at the 9% mark and Promo at the 23% mark). There doesn't seem to be a dupe at the end of the article.
Unaffected: * http://dumps.wikimedia.org/svwiki/20150602/svwiki-20150602-pages-articles.xm... * http://dumps.wikimedia.org/svwiktionary/20150603/svwiktionary-20150603-pages... * http://dumps.wikimedia.org/svwikinews/20150602/svwikinews-20150602-pages-art...
Στις 04-07-2015, ημέρα Σαβ, και ώρα 23:26 -0400, ο/η gnosygnu έγραψε:
Hi. I've noticed that some June XML data dumps have duplicate <page> records, usually at the end of the dump.
Anyone know if this is intentional? One or two duplicate records is benign, but I'm slightly concerned that it may be a symptom of a larger problem. I've been working with the XML data dumps for over 3 years, and haven't seen this before.[1]
This was reported by another user also. See phab task T103670 for the report. Did you notice if the stub dumps contain those same duplicate entries?
In any case this is an error, and I need to make sure we are fixed for the next month's run.
Ariel
I list some examples below. They're only from the Swedish wikis and Spanish Wikipedia (which is what I started looking at this week) Let me know if you need any other info, and I'll be happy to provide.
Finally, for questions like these, is it best to email the mailing list, create a task in Phabricator or do both?
Thanks.
[1]: It may have started as recently as 2015 April. I stopped looking at dumps shortly before the May problems with the dump server.
Example 1: URL: http://dumps.wikimedia.org/svwikiversity/20150602/svwikiversity -20150602-pages-articles.xml.bz2 Title: Audi m8 ID: 18942 SHA1: gd16v3qkmjr2w2j35zhqitjfg97igjt) Note: Last article in dump. Repeated twice
Example 2: URL: http://dumps.wikimedia.org/svwikiquote/20150602/svwikiquote -20150602-pages-articles.xml.bz2 Title: Sommarens tolv månader ID: 6209 SHA1: 9yibnev7pn3atxicayjoay0ave7pcu6 Note: Last article in dump. Repeated twice
Example 3: URL: http://dumps.wikimedia.org/svwikibooks/20150602/svwikibooks -20150602-pages-articles.xml.bz2 Title: Topologi/Metriska rum ID: 10001 SHA1: 5zdkpxflzdxhy7gxclludnlasvl6tw3 Note: Last article in dump. Repeated twice
Example 4: URL: http://dumps.wikimedia.org/svwikisource/20150602/svwikisource -20150602-pages-articles.xml.bz2 Title: Afhandling om svenska stafsättet/4 ID: 88768 SHA1: 7zyj208ur4vit0t41z7xlftlyl69bo7 Note: Last article in dump. Repeated twice
Example 5: URL: http://dumps.wikimedia.org/eswiki/20150602/eswiki-20150602-pages -articles.xml.bz2 Title (1): Veguer Title (2): Promo Note: duplicates are earlier in the dump (Veguer at the 9% mark and Promo at the 23% mark). There doesn't seem to be a dupe at the end of the article.
Unaffected:
-articles.xml.bz2
-20150603-pages-articles.xml.bz2
-pages-articles.xml.bz2
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Yup, they do show up in the stubs. I checked the four Swedish dumps. I left a comment there at https://phabricator.wikimedia.org/T103670#1432521
Let me know if there's anything else. Thanks!
On Mon, Jul 6, 2015 at 1:22 PM, Ariel T. Glenn aglenn@wikimedia.org wrote:
Στις 04-07-2015, ημέρα Σαβ, και ώρα 23:26 -0400, ο/η gnosygnu έγραψε:
Hi. I've noticed that some June XML data dumps have duplicate <page> records, usually at the end of the dump.
Anyone know if this is intentional? One or two duplicate records is benign, but I'm slightly concerned that it may be a symptom of a larger problem. I've been working with the XML data dumps for over 3 years, and haven't seen this before.[1]
This was reported by another user also. See phab task T103670 for the report. Did you notice if the stub dumps contain those same duplicate entries?
In any case this is an error, and I need to make sure we are fixed for the next month's run.
Ariel
I list some examples below. They're only from the Swedish wikis and Spanish Wikipedia (which is what I started looking at this week) Let me know if you need any other info, and I'll be happy to provide.
Finally, for questions like these, is it best to email the mailing list, create a task in Phabricator or do both?
Thanks.
[1]: It may have started as recently as 2015 April. I stopped looking at dumps shortly before the May problems with the dump server.
Example 1: URL: http://dumps.wikimedia.org/svwikiversity/20150602/svwikiversity -20150602-pages-articles.xml.bz2 Title: Audi m8 ID: 18942 SHA1: gd16v3qkmjr2w2j35zhqitjfg97igjt) Note: Last article in dump. Repeated twice
Example 2: URL: http://dumps.wikimedia.org/svwikiquote/20150602/svwikiquote -20150602-pages-articles.xml.bz2 Title: Sommarens tolv månader ID: 6209 SHA1: 9yibnev7pn3atxicayjoay0ave7pcu6 Note: Last article in dump. Repeated twice
Example 3: URL: http://dumps.wikimedia.org/svwikibooks/20150602/svwikibooks -20150602-pages-articles.xml.bz2 Title: Topologi/Metriska rum ID: 10001 SHA1: 5zdkpxflzdxhy7gxclludnlasvl6tw3 Note: Last article in dump. Repeated twice
Example 4: URL: http://dumps.wikimedia.org/svwikisource/20150602/svwikisource -20150602-pages-articles.xml.bz2 Title: Afhandling om svenska stafsättet/4 ID: 88768 SHA1: 7zyj208ur4vit0t41z7xlftlyl69bo7 Note: Last article in dump. Repeated twice
Example 5: URL: http://dumps.wikimedia.org/eswiki/20150602/eswiki-20150602-pages -articles.xml.bz2 Title (1): Veguer Title (2): Promo Note: duplicates are earlier in the dump (Veguer at the 9% mark and Promo at the 23% mark). There doesn't seem to be a dupe at the end of the article.
Unaffected:
-articles.xml.bz2
-20150603-pages-articles.xml.bz2
-pages-articles.xml.bz2
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org