Στις 04-07-2015, ημέρα Σαβ, και ώρα 23:26 -0400, ο/η gnosygnu έγραψε:
Hi. I've noticed that some June XML data dumps
have duplicate <page>
records, usually at the end of the dump.
Anyone know if this is intentional? One or two duplicate records is
benign, but I'm slightly concerned that it may be a symptom of a
larger problem. I've been working with the XML data dumps for over 3
years, and haven't seen this before.[1]
This was reported by another user also.
See phab task T103670 for the
report. Did you notice if the stub dumps contain those same duplicate
entries?
In any case this is an error, and I need to make sure we are fixed for
the next month's run.
Ariel
I list some examples below. They're only from the
Swedish wikis and
Spanish Wikipedia (which is what I started looking at this week) Let
me know if you need any other info, and I'll be happy to provide.
Finally, for questions like these, is it best to email the mailing
list, create a task in Phabricator or do both?
Thanks.
[1]: It may have started as recently as 2015 April. I stopped looking
at dumps shortly before the May problems with the dump server.
----
Example 1:
URL:
http://dumps.wikimedia.org/svwikiversity/20150602/svwikiversity
-20150602-pages-articles.xml.bz2
Title: Audi m8
ID: 18942
SHA1: gd16v3qkmjr2w2j35zhqitjfg97igjt)
Note: Last article in dump. Repeated twice
Example 2:
URL:
http://dumps.wikimedia.org/svwikiquote/20150602/svwikiquote
-20150602-pages-articles.xml.bz2
Title: Sommarens tolv månader
ID: 6209
SHA1: 9yibnev7pn3atxicayjoay0ave7pcu6
Note: Last article in dump. Repeated twice
Example 3:
URL:
http://dumps.wikimedia.org/svwikibooks/20150602/svwikibooks
-20150602-pages-articles.xml.bz2
Title: Topologi/Metriska rum
ID: 10001
SHA1: 5zdkpxflzdxhy7gxclludnlasvl6tw3
Note: Last article in dump. Repeated twice
Example 4:
URL:
http://dumps.wikimedia.org/svwikisource/20150602/svwikisource
-20150602-pages-articles.xml.bz2
Title: Afhandling om svenska stafsättet/4
ID: 88768
SHA1: 7zyj208ur4vit0t41z7xlftlyl69bo7
Note: Last article in dump. Repeated twice
Example 5:
URL:
http://dumps.wikimedia.org/eswiki/20150602/eswiki-20150602-pages
-articles.xml.bz2
Title (1): Veguer
Title (2): Promo
Note: duplicates are earlier in the dump (Veguer at the 9% mark and
Promo at the 23% mark). There doesn't seem to be a dupe at the end of
the article.
Unaffected:
*
http://dumps.wikimedia.org/svwiki/20150602/svwiki-20150602-pages
-articles.xml.bz2
*
http://dumps.wikimedia.org/svwiktionary/20150603/svwiktionary
-20150603-pages-articles.xml.bz2
*
http://dumps.wikimedia.org/svwikinews/20150602/svwikinews-20150602
-pages-articles.xml.bz2
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l