The most recent enwiki dump seems corrupt (CRC failure when bunzipping). Another person (Nessus) has also noticed this, so it's not just me: http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-200801...
Steps to reproduce:
lsb32@cmt:~/enwiki> md5sum enwiki-20080103-pages-meta-current.xml.bz2 9aa19d3a871071f4895431f19d674650 enwiki-20080103-pages-meta-current.xml.bz2 lsb32@cmt:~/enwiki> bzip2 -tvv enwiki-20080103-pages-meta-current.xml.bz2 &> bunzip.log lsb32@cmt:~/enwiki> tail bunzip.log [3490: huff+mtf rt+rld] [3491: huff+mtf rt+rld] [3492: huff+mtf rt+rld] [3493: huff+mtf rt+rld] [3494: huff+mtf rt+rld] [3495: huff+mtf data integrity (CRC) error in data
You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. lsb32@cmt:~/enwiki> bzip2 -V bzip2, a block-sorting file compressor. Version 1.0.3, 15-Feb-2005.
Copyright (C) 1996-2005 by Julian Seward.
This program is free software; you can redistribute it and/or modify it under the terms set out in the LICENSE file, which is included in the bzip2-1.0 source distribution.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the LICENSE file for more details.
bzip2: I won't write compressed data to a terminal. bzip2: For help, type: `bzip2 --help'. lsb32@cmt:~/enwiki>
If you read previous threads, this is the #1 broken feature request right now for researchers and other people interested in full dumps.
The dump process have been broken now for more than a whole year, and admins are overwhelmed with urgent work, so it seems it will still take a long while to repair it again.
Other editions are starting to fail in the same black hole too, as the grow in size.
Bests,
Felipe.
Lev Bishop lev.bishop+wikitech@gmail.com escribió: The most recent enwiki dump seems corrupt (CRC failure when bunzipping). Another person (Nessus) has also noticed this, so it's not just me: http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-200801...
Steps to reproduce:
lsb32@cmt:~/enwiki> md5sum enwiki-20080103-pages-meta-current.xml.bz2 9aa19d3a871071f4895431f19d674650 enwiki-20080103-pages-meta-current.xml.bz2 lsb32@cmt:~/enwiki> bzip2 -tvv enwiki-20080103-pages-meta-current.xml.bz2 &> bunzip.log lsb32@cmt:~/enwiki> tail bunzip.log [3490: huff+mtf rt+rld] [3491: huff+mtf rt+rld] [3492: huff+mtf rt+rld] [3493: huff+mtf rt+rld] [3494: huff+mtf rt+rld] [3495: huff+mtf data integrity (CRC) error in data
You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. lsb32@cmt:~/enwiki> bzip2 -V bzip2, a block-sorting file compressor. Version 1.0.3, 15-Feb-2005.
Copyright (C) 1996-2005 by Julian Seward.
This program is free software; you can redistribute it and/or modify it under the terms set out in the LICENSE file, which is included in the bzip2-1.0 source distribution.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the LICENSE file for more details.
bzip2: I won't write compressed data to a terminal. bzip2: For help, type: `bzip2 --help'. lsb32@cmt:~/enwiki>
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------------------------------
¿Con Mascota por primera vez? - Sé un mejor Amigo Entra en Yahoo! Respuestas.
On Jan 28, 2008 7:31 PM, Felipe Ortega wrote:
If you read previous threads, this is the #1 broken feature request right now for researchers and other people interested in full dumps.
Thank you for responding. I have checked some previous threads and I see that full dumps (with history) for enwiki, dewiki and others have been a problem for some years now. I also saw that the dump server crashed in december, got fixed a few weeks later and then died completely and the machine had to be rebuilt.
However, this seems to be a different problem from the previous issues, because:
1) this is the without-history dump that has the problem (pages-meta-current not pages-meta-history);
2) the dump appeared to have completed properly (on the status page there is no mention of any error, and the md5 checksum was generated (and matches with the md5sum of the downloaded file))
To reinforce why I think this is a new problem, in this message http://lists.wikimedia.org/pipermail/wikitech-l/2007-November/034561.html David A. Desrosiers says (in regards to a question about possibly corrupted enwiki-20071018-pages-meta-current.xml.bz2 )
I have the whole process of fetch, unpack, import scripted to happen unattended and aside from initial debugging, it has not failed yet in the last year or more.
Anyway, to save people from spending time and bandwidth downloading 6GB (or larger) files, which then turn out to be corrupt and useless, I would like to request if the dump script could be changed to run an integrity check (bzip2 -t) on the file before updating the status to "done". It only takes about 7 minutes on my computer to do this test for the enwiki pages-meta-current file -- compared with the 46 hours it took to generate the dump in the first place this should not add significantly to the time taken to generate dumps.
Lev
Lev Bishop wrote:
Anyway, to save people from spending time and bandwidth downloading 6GB (or larger) files, which then turn out to be corrupt and useless, I would like to request if the dump script could be changed to run an integrity check (bzip2 -t) on the file before updating the status to "done".
Could do that...
-- brion
Lev Bishop lev.bishop+wikitech@gmail.com escribió: 1) this is the without-history dump that has the problem (pages-meta-current not pages-meta-history);
Sorry, then I misunderstood the version.
Felipe.
Lev
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
---------------------------------
¿Con Mascota por primera vez? - Sé un mejor Amigo Entra en Yahoo! Respuestas.
Lev Bishop wrote:
The most recent enwiki dump seems corrupt (CRC failure when bunzipping). Another person (Nessus) has also noticed this, so it's not just me: http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-200801...
This file's been regenerated, BTW; now passes bzip2 integrity check.
-- brion vibber (brion @ wikimedia.org)
On Sat, Mar 1, 2008 at 2:31 AM, Brion Vibber brion@wikimedia.org wrote:
Lev Bishop wrote:
The most recent enwiki dump seems corrupt (CRC failure when bunzipping). Another person (Nessus) has also noticed this, so it's not just me: http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-200801...
This file's been regenerated, BTW; now passes bzip2 integrity check.
-- brion vibber (brion @ wikimedia.org)
What's the new md5sum? The old one is still in enwiki-20080103-md5sums.txt
Are there any other files that were regenerated?
wikitech-l@lists.wikimedia.org