According to http://download.wikimedia.org/enwiki/20100130/ , the pages-meta-history.xml.bz2 file for that dump is 280.3 Gbytes in size.
In the http://download.wikimedia.org/enwiki/20100312/ dump, the corresponding file is only 178.7 Gbytes.
Is this the result of better compression, or has something gone wrong?
Kind regards,
Neil
--- El mar, 13/4/10, Neil Harris usenet@tonal.clara.co.uk escribió:
De: Neil Harris usenet@tonal.clara.co.uk Asunto: [Xmldatadumps-l] Changing lengths of full dump Para: Xmldatadumps-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 16:01 According to http://download.wikimedia.org/enwiki/20100130/ , the pages-meta-history.xml.bz2 file for that dump is 280.3 Gbytes in size.
In the http://download.wikimedia.org/enwiki/20100312/ dump, the corresponding file is only 178.7 Gbytes.
Is this the result of better compression, or has something gone wrong?
Hi Neal.
Some mails were just exchanged in this mailing list on this. Indeed, there was some problem in the generation of the last dump.
Best, F.
Kind regards,
Neil
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On 13/04/10 16:03, Felipe Ortega wrote:
--- El mar, 13/4/10, Neil Harrisusenet@tonal.clara.co.uk escribió:
De: Neil Harrisusenet@tonal.clara.co.uk Asunto: [Xmldatadumps-l] Changing lengths of full dump Para: Xmldatadumps-l@lists.wikimedia.org Fecha: martes, 13 de abril, 2010 16:01 According to http://download.wikimedia.org/enwiki/20100130/ , the pages-meta-history.xml.bz2 file for that dump is 280.3 Gbytes in size.
In the http://download.wikimedia.org/enwiki/20100312/ dump, the corresponding file is only 178.7 Gbytes.
Is this the result of better compression, or has something gone wrong?
Hi Neal.
Some mails were just exchanged in this mailing list on this. Indeed, there was some problem in the generation of the last dump.
Best, F.
Thanks for letting me know.
Since dumps appear to be made incrementally on top of other dumps, there seems to be a real risk of errors being compounded on top of errors. Does anyone here know if there have been any attempts to validate the current enwiki full dump against the database? For example, by selecting N revisions from the dump at random, and verifying that they exist in the DB, and vice versa for N revisions selected from the DB at random.
Kind regards,
Neil
Neil Harris wrote:
...
Thanks for letting me know.
Since dumps appear to be made incrementally on top of other dumps, there seems to be a real risk of errors being compounded on top of errors. Does anyone here know if there have been any attempts to validate the current enwiki full dump against the database? For example, by selecting
Erik Zachte validated the 20100130 and found it intact and inclusive of all expected changes.
If you do find any holes 20100130 then please let know.
--tomasz
----- Original Message ----- From: Tomasz Finc tfinc@wikimedia.org Date: Wednesday, April 14, 2010 1:22 am Subject: Re: [Xmldatadumps-l] Changing lengths of full dump To: Neil Harris usenet@tonal.clara.co.uk Cc: xmldatadumps-l@lists.wikimedia.org
Neil Harris wrote:
...
Thanks for letting me know.
Since dumps appear to be made incrementally on top of other
dumps, there
seems to be a real risk of errors being compounded on top of
errors.
Does anyone here know if there have been any attempts to
validate the
current enwiki full dump against the database? For example, by
selecting
Erik Zachte validated the 20100130 and found it intact and inclusive of all expected changes.
If you do find any holes 20100130 then please let know.
Hi,
I watched the 20100130 enwiki pages-meta-history dump progress and noticed one potential problem.
There is a line on the enwiki dump progress page that gets updated as the dump is progressing on this page
http://download.wikimedia.org/enwiki/20100130/
and the line looks like this: 2010-03-11 01:10:08: enwiki 19376810 pages (6.350/sec), 313797035 revs (102.827/sec), 89.7% prefetched, ETA 2010-03-14 04:35:01 [max 341714004]
I watched the page count from the above line increment and compared it to the incrementing file size of the "pages-meta-history.xml.bz2" dump and I extrapolated that the dump would be about 320GB by the time it was done, but it was only 280.3GB, I noticed that the file size grew slower and slower as the page count neared the final total of 19376810 pages, which is unexpected as I thought the file size would grow fairly linearly with the page count, but for the last 10% or so of the pages the file size hardly grew at all. I have a script now to regularly poll the current dump status page and will make some graphs of file size vs. page count etc for the next enwiki pages-meta-history.xml.bz2 dump. The curves of the graphs won't show if the dump is successful but can help show if there is a problem in the dump. The page count vs file size curves for consecutive "enwiki pages-meta-history.xml.bz2" dumps should be almost identical so should be useful for error checking.
cheers, Jamie
--tomasz
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org