On Jan 28, 2008 7:31 PM, Felipe Ortega wrote:
If you read previous threads, this is the #1 broken feature request right now for researchers and other people interested in full dumps.
Thank you for responding. I have checked some previous threads and I see that full dumps (with history) for enwiki, dewiki and others have been a problem for some years now. I also saw that the dump server crashed in december, got fixed a few weeks later and then died completely and the machine had to be rebuilt.
However, this seems to be a different problem from the previous issues, because:
1) this is the without-history dump that has the problem (pages-meta-current not pages-meta-history);
2) the dump appeared to have completed properly (on the status page there is no mention of any error, and the md5 checksum was generated (and matches with the md5sum of the downloaded file))
To reinforce why I think this is a new problem, in this message http://lists.wikimedia.org/pipermail/wikitech-l/2007-November/034561.html David A. Desrosiers says (in regards to a question about possibly corrupted enwiki-20071018-pages-meta-current.xml.bz2 )
I have the whole process of fetch, unpack, import scripted to happen unattended and aside from initial debugging, it has not failed yet in the last year or more.
Anyway, to save people from spending time and bandwidth downloading 6GB (or larger) files, which then turn out to be corrupt and useless, I would like to request if the dump script could be changed to run an integrity check (bzip2 -t) on the file before updating the status to "done". It only takes about 7 minutes on my computer to do this test for the enwiki pages-meta-current file -- compared with the 46 hours it took to generate the dump in the first place this should not add significantly to the time taken to generate dumps.
Lev