Well, that used up all my good luck for the year, but the bz2s are ready for download. The md5sums are still calculating, give them a couple hours to show up. If all continues to go well we'll have the 7z files in 4-5 days.
As before I do not plan to provide a single 350gb file of the bz2, nor a single 7z file for download.
Happy trails,
Ariel
Hi all,
Congrats Ariel! :) The sum of pages-meta-history files for the last two enwiki dumps are 342.7GB for the 20110115 dump and 353.5GB for the 20110317 dump, which shows that the overall dump size grew over 2 months. Seven of the individually numbered pages-meta-history files reduced in size while eight increased in size from 20110115 to 20110317. By far the biggest decrease was the pages-meta-history10.xml.bz2 file which dropped from 18.7GB down to 1.9GB. I think there is probably missing revisions in that page ID range.
Here are some historical dumps sizes for comparison to show the growth of these files:
enwiki-20060816-pages-meta-history.xml.7z 5.08GB
enwiki-20070402-pages-meta-history.xml.7z 11.3GB (229 days since previous dump)
enwiki-20080103-pages-meta-history.xml.7z 17.2GB (276 days since previous dump)
enwiki-20100130-pages-meta-history.xml.7z 31.8GB (758 days since previous dump)
enwiki-20110115-pages-meta-history[1-15].xml.7z 38.0GB (350 days since previous dump)
enwiki-20110115-pages-meta-history[1-15].xml.7z (7z compression in progress)
Here's a graph of this data showing the dump file size growth seems to be pretty linear:
(chart x-axis starts from 20060816 dump and ends at 20110115 dump)
"http://nekrom.com/wikipedia/enwiki%20history%20dump%20file%20size%20over%20t..."
cheers,
Jamie
----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Tuesday, March 29, 2011 3:24 pm Subject: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready To: xmldatadumps-l@lists.wikimedia.org Cc: wikitech-l@lists.wikimedia.org
Well, that used up all my good luck for the year, but the bz2s are ready for download. The md5sums are still calculating, give them a couple hours to show up. If all continues to go well we'll have the 7z files in 4-5 days.
As before I do not plan to provide a single 350gb file of the bz2, nor a single 7z file for download.
Happy trails,
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
On Tue, Mar 29, 2011 at 6:08 PM, Jamie Morken jmorken@shaw.ca wrote:
Hi all,
[...]
Here are some historical dumps sizes for comparison to show the growth of these files:
enwiki-20060816-pages-meta-history.xml.7z 5.08GB
enwiki-20070402-pages-meta-history.xml.7z 11.3GB (229 days since previous dump)
enwiki-20080103-pages-meta-history.xml.7z 17.2GB (276 days since previous dump)
enwiki-20100130-pages-meta-history.xml.7z 31.8GB (758 days since previous dump)
enwiki-20110115-pages-meta-history[1-15].xml.7z 38.0GB (350 days since previous dump)
enwiki-20110115-pages-meta-history[1-15].xml.7z (7z compression in progress)
[...]
cheers,
Jamie
According to this data the 7z dump for enwp will reach 1 terabyte on Jan 2, 2145.
=)
----- Original Message ----- From: Brian J Mingus brian.mingus@Colorado.EDU Date: Tuesday, March 29, 2011 7:15 pm Subject: Re: [Wikitech-l] [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready To: Wikimedia developers wikitech-l@lists.wikimedia.org Cc: Jamie Morken jmorken@shaw.ca, "Ariel T. Glenn" ariel@wikimedia.org, xmldatadumps-l@lists.wikimedia.org
According to this data the 7z dump for enwp will reach 1 terabyte on Jan 2, 2145.
=)
wanna bet? :)
cheers, Jamie
-- Brian Mingus Graduate student Computational Cognitive Neuroscience Lab University of Colorado at Boulder
----- Original Message ----- From: Brian J Mingus brian.mingus@Colorado.EDU Date: Tuesday, March 29, 2011 7:15 pm Subject: Re: [Wikitech-l] [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready To: Wikimedia developers wikitech-l@lists.wikimedia.org Cc: Jamie Morken jmorken@shaw.ca, "Ariel T. Glenn" ariel@wikimedia.org, xmldatadumps-l@lists.wikimedia.org
According to this data the 7z dump for enwp will reach 1 terabyte on Jan 2, 2145.
=)
Hi,
I made a graph for the uncompressed XML file size for the enwiki pages-meta-history files over time, I thought that these files would be growing exponentially but they appear to grow linear. For comparison in 2145 the raw XML should be about 178 TB I think, so the 7z files are growing linearly about 180x faster than the raw XML.
"http://nekrom.com/wikipedia/enwiki%20history%20uncompressed%20XML%20dump%20f..."
(data below)
cheers, Jamie
enwiki-20060816-pages-meta-history.xml 782741875000 (728.99 GB) enwiki-20070402-pages-meta-history.xml 1763048493749 (1641.97 GB) (229 days since previous dump) enwiki-20080103-pages-meta-history.xml 2807444044080 (2614.64 GB) (276 days since previous dump) enwiki-20100130-pages-meta-history.xml 5873134833455 (5469.78 GB) (758 days since previous dump) enwiki-20110115-pages-meta-history[1-15].xml 7218617857754 (6722.86 GB) (350 days since previous dump)
enwiki-20110115-pages-meta-history1.xml 1 080 719 385 129 enwiki-20110115-pages-meta-history2.xml 677 956 948 289 enwiki-20110115-pages-meta-history3.xml 550 889 319 423 enwiki-20110115-pages-meta-history4.xml 447 001 611 247 enwiki-20110115-pages-meta-history5.xml 453 700 983 270 enwiki-20110115-pages-meta-history6.xml 540 208 590 115 enwiki-20110115-pages-meta-history7.xml 458 817 000 243 enwiki-20110115-pages-meta-history8.xml 649 710 293 818 enwiki-20110115-pages-meta-history9.xml 471 183 250 318 enwiki-20110115-pages-meta-history10.xml 406 115 459 739 enwiki-20110115-pages-meta-history11.xml 342 840 308 580 enwiki-20110115-pages-meta-history12.xml 310 507 626 798 enwiki-20110115-pages-meta-history13.xml 362 264 384 002 enwiki-20110115-pages-meta-history14.xml 269 988 897 698 enwiki-20110115-pages-meta-history15.xml 196 713 799 085
-- Brian Mingus Graduate student Computational Cognitive Neuroscience Lab University of Colorado at Boulder
The individually numbered files change sizes radically because I'm moving around start and end points. You can ignore that.
I am looking at piece 10 however to see why it's smaller: ah. I have a typo in the size for that one, I asked for only 200000 pages to go in it instead of the 240000 I intended :-D And so that's all that went in (minus deleted pages). Nothing's missing though; anything "extra" winds up in the last piece (15). You can look at the stub files to verify that.
FWIW we'll be juggling the number of pages per chunk on a regular basis.
Ariel
Στις 29-03-2011, ημέρα Τρι, και ώρα 17:08 -0700, ο/η Jamie Morken έγραψε:
Hi all,
Congrats Ariel! :) The sum of pages-meta-history files for the last two enwiki dumps are 342.7GB for the 20110115 dump and 353.5GB for the 20110317 dump, which shows that the overall dump size grew over 2 months. Seven of the individually numbered pages-meta-history files reduced in size while eight increased in size from 20110115 to 20110317. By far the biggest decrease was the pages-meta-history10.xml.bz2 file which dropped from 18.7GB down to 1.9GB. I think there is probably missing revisions in that page ID range.
Here are some historical dumps sizes for comparison to show the growth of these files:
enwiki-20060816-pages-meta-history.xml.7z 5.08GB enwiki-20070402-pages-meta-history.xml.7z 11.3GB (229 days since previous dump) enwiki-20080103-pages-meta-history.xml.7z 17.2GB (276 days since previous dump) enwiki-20100130-pages-meta-history.xml.7z 31.8GB (758 days since previous dump) enwiki-20110115-pages-meta-history[1-15].xml.7z 38.0GB (350 days since previous dump) enwiki-20110115-pages-meta-history[1-15].xml.7z (7z compression in progress)
Here's a graph of this data showing the dump file size growth seems to be pretty linear: (chart x-axis starts from 20060816 dump and ends at 20110115 dump) "http://nekrom.com/wikipedia/enwiki%20history%20dump%20file%20size% 20over%20time.png"
cheers, Jamie
----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Tuesday, March 29, 2011 3:24 pm Subject: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready To: xmldatadumps-l@lists.wikimedia.org Cc: wikitech-l@lists.wikimedia.org
Well, that used up all my good luck for the year, but the bz2s are ready for download. The md5sums are still calculating, give them a couple hours to show up. If all continues to go well we'll have the 7z files in 4-5 days.
As before I do not plan to provide a single 350gb file of the bz2, nor a single 7z file for download.
Happy trails,
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Hi,
Thanks for the info, while I was at it I did some more checking of the history dump file sizes and compression ratios (as reported by 7-Zip 9.20):
enwiki-20110115-pages-meta-history1.xml.7z 434.99x compression enwiki-20110115-pages-meta-history2.xml.7z 289.46x compression enwiki-20110115-pages-meta-history3.xml.7z 248.72x compression enwiki-20110115-pages-meta-history4.xml.7z 216.29x compression enwiki-20110115-pages-meta-history5.xml.7z 198.67x compression enwiki-20110115-pages-meta-history6.xml.7z 176.94x compression enwiki-20110115-pages-meta-history7.xml.7z 161.42x compression enwiki-20110115-pages-meta-history8.xml.7z 208.59x compression enwiki-20110115-pages-meta-history9.xml.7z 126.86x compression enwiki-20110115-pages-meta-history10.xml.7z 112.10x compression enwiki-20110115-pages-meta-history11.xml.7z 117.27x compression enwiki-20110115-pages-meta-history12.xml.7z 118.88x compression enwiki-20110115-pages-meta-history13.xml.7z 133.07x compression enwiki-20110115-pages-meta-history14.xml.7z 107.10x compression enwiki-20110115-pages-meta-history15.xml.7z 83.24x compression
pages-meta-history1 has the oldest articles and also the most revisions, therefore it has the highest compression ratio (as most revisions have only minor changes for established articles). The pages-meta-history15 file contains the most recently created articles which have the least revisions, but tend to have greater relative changes compared to the overall article size, and thus has the lowest 7z compression.
enwiki-20110115-pages-meta-history8.xml doesn't follow the pattern of decreasing compression ratios.
That's all I can report without actually looking inside these files! :)
cheers, Jamie
----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Tuesday, March 29, 2011 11:43 pm Subject: Re: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready To: Jamie Morken jmorken@shaw.ca Cc: xmldatadumps-l@lists.wikimedia.org, wikitech-l@lists.wikimedia.org
The individually numbered files change sizes radically because I'm moving around start and end points. You can ignore that.
I am looking at piece 10 however to see why it's smaller: ah. I have a typo in the size for that one, I asked for only 200000 pages to go in it instead of the 240000 I intended :-D And so that's all that went in (minus deleted pages). Nothing's missing though; anything "extra" winds up in the last piece (15). You can look at the stub files to verify that.
FWIW we'll be juggling the number of pages per chunk on a regular basis.
Ariel
Στις 29-03-2011, ημέρα Τρι, και ώρα 17:08 -0700, ο/η Jamie Morken έγραψε:
Hi all,
Congrats Ariel! :) The sum of pages-meta-history files
for the last
two enwiki dumps are 342.7GB for the 20110115 dump and 353.5GB
for the
20110317 dump, which shows that the overall dump size grew
over 2
months. Seven of the individually numbered pages-meta-
history files
reduced in size while eight increased in size from 20110115 to 20110317. By far the biggest decrease was the pages-meta-history10.xml.bz2 file which dropped from 18.7GB
down to
1.9GB. I think there is probably missing revisions in
that page ID
range.
Here are some historical dumps sizes for comparison to show
the growth
of these files:
enwiki-20060816-pages-meta-history.xml.7z 5.08GB enwiki-20070402-pages-meta-history.xml.7z 11.3GB (229 days since previous dump) enwiki-20080103-pages-meta-history.xml.7z 17.2GB (276 days since previous dump) enwiki-20100130-pages-meta-history.xml.7z 31.8GB (758 days since previous dump) enwiki-20110115-pages-meta-history[1-15].xml.7z 38.0GB (350
days since
previous dump) enwiki-20110115-pages-meta-history[1-15].xml.7z (7z
compression in
progress)
Here's a graph of this data showing the dump file size growth
seems to
be pretty linear: (chart x-axis starts from 20060816 dump and ends at 20110115 dump) "http://nekrom.com/wikipedia/enwiki%20history%20dump%20file%20size% 20over%20time.png"
cheers, Jamie
----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Tuesday, March 29, 2011 3:24 pm Subject: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready To: xmldatadumps-l@lists.wikimedia.org Cc: wikitech-l@lists.wikimedia.org
Well, that used up all my good luck for the year, but the
bz2s
are ready for download. The md5sums are still calculating, give
them
a couple hours to show up. If all continues to go well we'll
have
the 7z files in 4-5 days.
As before I do not plan to provide a single 350gb file of
the
bz2, nor a single 7z file for download.
Happy trails,
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org