On Jan 28, 2008 7:31 PM, Felipe Ortega wrote:
If you read previous threads, this is the #1 broken
feature request right now for researchers and other people interested in full dumps.
Thank you for responding. I have checked some previous threads and I
see that full dumps (with history) for enwiki, dewiki and others have
been a problem for some years now. I also saw that the dump server
crashed in december, got fixed a few weeks later and then died
completely and the machine had to be rebuilt.
However, this seems to be a different problem from the previous issues, because:
1) this is the without-history dump that has the problem
(pages-meta-current not pages-meta-history);
2) the dump appeared to have completed properly (on the status page
there is no mention of any error, and the md5 checksum was generated
(and matches with the md5sum of the downloaded file))
To reinforce why I think this is a new problem, in this message
http://lists.wikimedia.org/pipermail/wikitech-l/2007-November/034561.html
David A. Desrosiers says (in regards to a question about possibly
corrupted enwiki-20071018-pages-meta-current.xml.bz2 )
I have the whole process of fetch, unpack, import
scripted to happen
unattended and aside from initial debugging, it has not failed yet in
the last year or more.
Anyway, to save people from spending time and bandwidth downloading
6GB (or larger) files, which then turn out to be corrupt and useless,
I would like to request if the dump script could be changed to run an
integrity check (bzip2 -t) on the file before updating the status to
"done". It only takes about 7 minutes on my computer to do this test
for the enwiki pages-meta-current file -- compared with the 46 hours
it took to generate the dump in the first place this should not add
significantly to the time taken to generate dumps.
Lev