----- Original Message -----
From: Tomasz Finc <tfinc(a)wikimedia.org>
Date: Wednesday, April 14, 2010 1:22 am
Subject: Re: [Xmldatadumps-l] Changing lengths of full dump
To: Neil Harris <usenet(a)tonal.clara.co.uk>
Cc: xmldatadumps-l(a)lists.wikimedia.org
Neil Harris wrote:
...
Thanks for letting me know.
Since dumps appear to be made incrementally on top of other
dumps, there
seems to be a real risk of errors being
compounded on top of
errors.
Does anyone here know if there have been any
attempts to
validate the
current enwiki full dump against the database?
For example, by
selecting
Erik Zachte validated the 20100130 and found it intact and
inclusive of
all expected changes.
If you do find any holes 20100130 then please let know.
Hi,
I watched the 20100130 enwiki pages-meta-history dump progress and noticed one potential
problem.
There is a line on the enwiki dump progress page that gets updated as the dump is
progressing on this page
http://download.wikimedia.org/enwiki/20100130/
and the line looks like this:
2010-03-11 01:10:08: enwiki 19376810 pages (6.350/sec), 313797035 revs
(102.827/sec), 89.7% prefetched, ETA 2010-03-14 04:35:01 [max 341714004]
I watched the page count from the above line increment and compared it to the incrementing
file size of the "pages-meta-history.xml.bz2" dump and I extrapolated that the
dump would be about 320GB by the time it was done, but it was only 280.3GB, I noticed that
the file size grew slower and slower as the page count neared the final total of 19376810
pages, which is unexpected as I thought the file size would grow fairly linearly with the
page count, but for the last 10% or so of the pages the file size hardly grew at all. I
have a script now to regularly poll the current dump status page and will make some graphs
of file size vs. page count etc for the next enwiki pages-meta-history.xml.bz2 dump. The
curves of the graphs won't show if the dump is successful but can help show if there
is a problem in the dump. The page count vs file size curves for consecutive "enwiki
pages-meta-history.xml.bz2" dumps should be almost identical so should be useful for
error checking.
cheers,
Jamie
--tomasz
_______________________________________________
Xmldatadumps-l mailing list
Xmldatadumps-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l