Hi,
Thanks for the info, while I was at it I did some more checking of the history dump file sizes and compression ratios (as reported by 7-Zip 9.20):
enwiki-20110115-pages-meta-history1.xml.7z 434.99x compression enwiki-20110115-pages-meta-history2.xml.7z 289.46x compression enwiki-20110115-pages-meta-history3.xml.7z 248.72x compression enwiki-20110115-pages-meta-history4.xml.7z 216.29x compression enwiki-20110115-pages-meta-history5.xml.7z 198.67x compression enwiki-20110115-pages-meta-history6.xml.7z 176.94x compression enwiki-20110115-pages-meta-history7.xml.7z 161.42x compression enwiki-20110115-pages-meta-history8.xml.7z 208.59x compression enwiki-20110115-pages-meta-history9.xml.7z 126.86x compression enwiki-20110115-pages-meta-history10.xml.7z 112.10x compression enwiki-20110115-pages-meta-history11.xml.7z 117.27x compression enwiki-20110115-pages-meta-history12.xml.7z 118.88x compression enwiki-20110115-pages-meta-history13.xml.7z 133.07x compression enwiki-20110115-pages-meta-history14.xml.7z 107.10x compression enwiki-20110115-pages-meta-history15.xml.7z 83.24x compression
pages-meta-history1 has the oldest articles and also the most revisions, therefore it has the highest compression ratio (as most revisions have only minor changes for established articles). The pages-meta-history15 file contains the most recently created articles which have the least revisions, but tend to have greater relative changes compared to the overall article size, and thus has the lowest 7z compression.
enwiki-20110115-pages-meta-history8.xml doesn't follow the pattern of decreasing compression ratios.
That's all I can report without actually looking inside these files! :)
cheers, Jamie
----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Tuesday, March 29, 2011 11:43 pm Subject: Re: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready To: Jamie Morken jmorken@shaw.ca Cc: xmldatadumps-l@lists.wikimedia.org, wikitech-l@lists.wikimedia.org
The individually numbered files change sizes radically because I'm moving around start and end points. You can ignore that.
I am looking at piece 10 however to see why it's smaller: ah. I have a typo in the size for that one, I asked for only 200000 pages to go in it instead of the 240000 I intended :-D And so that's all that went in (minus deleted pages). Nothing's missing though; anything "extra" winds up in the last piece (15). You can look at the stub files to verify that.
FWIW we'll be juggling the number of pages per chunk on a regular basis.
Ariel
Στις 29-03-2011, ημέρα Τρι, και ώρα 17:08 -0700, ο/η Jamie Morken έγραψε:
Hi all,
Congrats Ariel! :) The sum of pages-meta-history files
for the last
two enwiki dumps are 342.7GB for the 20110115 dump and 353.5GB
for the
20110317 dump, which shows that the overall dump size grew
over 2
months. Seven of the individually numbered pages-meta-
history files
reduced in size while eight increased in size from 20110115 to 20110317. By far the biggest decrease was the pages-meta-history10.xml.bz2 file which dropped from 18.7GB
down to
1.9GB. I think there is probably missing revisions in
that page ID
range.
Here are some historical dumps sizes for comparison to show
the growth
of these files:
enwiki-20060816-pages-meta-history.xml.7z 5.08GB enwiki-20070402-pages-meta-history.xml.7z 11.3GB (229 days since previous dump) enwiki-20080103-pages-meta-history.xml.7z 17.2GB (276 days since previous dump) enwiki-20100130-pages-meta-history.xml.7z 31.8GB (758 days since previous dump) enwiki-20110115-pages-meta-history[1-15].xml.7z 38.0GB (350
days since
previous dump) enwiki-20110115-pages-meta-history[1-15].xml.7z (7z
compression in
progress)
Here's a graph of this data showing the dump file size growth
seems to
be pretty linear: (chart x-axis starts from 20060816 dump and ends at 20110115 dump) "http://nekrom.com/wikipedia/enwiki%20history%20dump%20file%20size% 20over%20time.png"
cheers, Jamie
----- Original Message ----- From: "Ariel T. Glenn" ariel@wikimedia.org Date: Tuesday, March 29, 2011 3:24 pm Subject: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready To: xmldatadumps-l@lists.wikimedia.org Cc: wikitech-l@lists.wikimedia.org
Well, that used up all my good luck for the year, but the
bz2s
are ready for download. The md5sums are still calculating, give
them
a couple hours to show up. If all continues to go well we'll
have
the 7z files in 4-5 days.
As before I do not plan to provide a single 350gb file of
the
bz2, nor a single 7z file for download.
Happy trails,
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l