----- Original Message -----
From: Tomasz Finc <tfinc@wikimedia.org>
Date: Wednesday, April 14, 2010 1:22 am
Subject: Re: [Xmldatadumps-l] Changing lengths of full dump
To: Neil Harris <usenet@tonal.clara.co.uk>
Cc: xmldatadumps-l@lists.wikimedia.org

> Neil Harris wrote:
>
> ...
>
> > Thanks for letting me know.
> >
> > Since dumps appear to be made incrementally on top of other
> dumps, there
> > seems to be a real risk of errors being compounded on top of
> errors.
> > Does anyone here know if there have been any attempts to
> validate the
> > current enwiki full dump against the database? For example, by
> selecting
>
> Erik Zachte validated the 20100130 and found it intact and
> inclusive of
> all expected changes.
>
> If you do find any holes 20100130 then please let know.

Hi,

I watched the 20100130 enwiki pages-meta-history dump progress and noticed one potential problem.

There is a line on the enwiki dump progress page that gets updated as the dump is progressing on this page

http://download.wikimedia.org/enwiki/20100130/

and the line looks like this:
2010-03-11 01:10:08: enwiki 19376810 pages (6.350/sec), 313797035 revs (102.827/sec), 89.7% prefetched, ETA 2010-03-14 04:35:01 [max 341714004]

I watched the page count from the above line increment and compared it to the incrementing file size of the "pages-meta-history.xml.bz2" dump and I extrapolated that the dump would be about 320GB by the time it was done, but it was only 280.3GB, I noticed that the file size grew slower and slower as the page count neared the final total of 19376810 pages, which is unexpected as I thought the file size would grow fairly linearly with the page count, but for the last 10% or so of the pages the file size hardly grew at all.  I have a script now to regularly poll the current dump status page and will make some graphs of file size vs. page count etc for the next enwiki pages-meta-history.xml.bz2 dump.  The curves of the graphs won't show if the dump is successful but can help show if there is a problem in the dump.  The page count vs file size curves for consecutive "enwiki pages-meta-history.xml.bz2" dumps should be almost identical so should be useful for error checking.

cheers,
Jamie

>
> --tomasz
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>