Date: Wed, 17 Feb 2010 05:01:43 +0100 From: Tomasz Finc tfinc@wikimedia.org Subject: Re: [Wikitech-l] enwiki complete page edit history To: Wikimedia developers wikitech-l@lists.wikimedia.org Message-ID: 4B7B6A27.9040200@wikimedia.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed
It sadly failed as noted in
http://lists.wikimedia.org/pipermail/xmldatadumps-admin-l/2010-January/00007...
I've updated the index to clear that up.
--tomasz
Hi Tomasz,
The pages-meta-history.xml.bz2 is showing 115.4GB written (in progress) at: http://download.wikipedia.org/enwiki/20100130/
The older pages-meta-history.xml.bz2 from http://download.wikipedia.org/enwiki/20091128/ shows 255.1GB written (failed build)
So once the 20100130 current pages-meta-history.xml.bz2 dump is finished writing, will it be over 255GB as it is newer than the older copy and contains more info?
Also these big files aren't weblinked for download lately I noticed. I think they should be as they contain the full wikipedia history/discussion pages which have humongous amounts of useful information that should be available for easy distribution. What is the reason they aren't weblinked, the bandwidth costs?
cheers, Jamie
Jamie Morken wrote:
Hi,
I was looking at the enwiki dump progress and noticed the file size for the enwiki pages-meta-history.xml.bz2 has decreased from 255GB on 20100125 down to 105GB on 20100203. Is it possible that old page revision edit data is being lost due to the smaller archive file size?
2009-12-03 12:53:43 in-progress All pages with complete page edit history (.bz2)2010-01-25 16:02:21: enwiki 14833408 pages (3.231/sec), 284292000 revs (61.930/sec), 54.7% prefetched, ETA 2010-02-03 02:34:19 [max 329446505] These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.pages-meta-history.xml.bz2 255.1 GB (written) 2010-02-03 17:28:43 in-progress All pages with complete page edit history (.bz2)2010-02-16 00:32:55: enwiki 747550 pages (0.704/sec), 95964000 revs (90.340/sec), 95.8% prefetched, ETA 2010-03-19 12:10:50 [max 341714004] These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.pages-meta-history.xml.bz2 105.1 GB (written) cheers, Jamie
The pages-meta-history.xml.bz2 is showing 115.4GB written (in progress) at: http://download.wikipedia.org/enwiki/20100130/
The older pages-meta-history.xml.bz2 from http://download.wikipedia.org/enwiki/20091128/ shows 255.1GB written (failed build)
So once the 20100130 current pages-meta-history.xml.bz2 dump is finished writing, will it be over 255GB as it is newer than the older copy and contains more info?
Correct.
Also these big files aren't weblinked for download lately I noticed. I think they should be as they contain the full wikipedia history/discussion pages which have humongous amounts of useful information that should be available for easy distribution. What is the reason they aren't
weblinked, the bandwidth costs?
Do you mean that the failed runs aren't web linked? If so then I'd rather not point people to corrupted files.
--tomasz
wikitech-l@lists.wikimedia.org