The latest enwiki pages dump of enwiki-latest-pages-articles.xml.bz2 in http://dumps.wikimedia.org/enwiki/latest/ is only 5.8 GB. Previous versions, e.g. http://dumps.wikimedia.org/enwiki/20110526/ and http://dumps.wikimedia.org/enwiki/20110405/ have been consistently around 6.7-6.8GB.
I saw this after noticing that many pages are missing from the newest dump, e.g. http://en.wikipedia.org/wiki/Liar_Liar and http://en.wikipedia.org/wiki/Juan_que_re%C3%ADa.
Is this a known problem? Can anything be done to prevent this in the future?
Thanks, Eric
Yes, it's a known problem; you should be able to download the pieces instead; yes code is being tested to detect truncated files and flag them. In the meantime I have to do some other testing to see whether we're running into some constraint running this many jobs at once, which causes the bzips to die off or be killed off.
Ariel
Στις 05-07-2011, ημέρα Τρι, και ώρα 14:09 -0700, ο/η Eric Sun έγραψε:
The latest enwiki pages dump of enwiki-latest-pages-articles.xml.bz2 in http://dumps.wikimedia.org/enwiki/latest/ is only 5.8 GB. Previous versions, e.g. http://dumps.wikimedia.org/enwiki/20110526/ and http://dumps.wikimedia.org/enwiki/20110405/ have been consistently around 6.7-6.8GB.
I saw this after noticing that many pages are missing from the newest dump, e.g. http://en.wikipedia.org/wiki/Liar_Liar and http://en.wikipedia.org/wiki/Juan_que_re%C3%ADa.
Is this a known problem? Can anything be done to prevent this in the future?
Thanks, Eric _______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org