Hi,
Thanks for the info, while I was at it I did some more checking of the history dump file sizes and compression ratios (as reported by 7-Zip 9.20):
enwiki-20110115-pages-meta-history1.xml.7z 434.99x compression
enwiki-20110115-pages-meta-history2.xml.7z 289.46x compression
enwiki-20110115-pages-meta-history3.xml.7z 248.72x compression
enwiki-20110115-pages-meta-history4.xml.7z 216.29x compression
enwiki-20110115-pages-meta-history5.xml.7z 198.67x compression
enwiki-20110115-pages-meta-history6.xml.7z 176.94x compression
enwiki-20110115-pages-meta-history7.xml.7z 161.42x compression
enwiki-20110115-pages-meta-history8.xml.7z 208.59x compression
enwiki-20110115-pages-meta-history9.xml.7z 126.86x compression
enwiki-20110115-pages-meta-history10.xml.7z 112.10x compression
enwiki-20110115-pages-meta-history11.xml.7z 117.27x compression
enwiki-20110115-pages-meta-history12.xml.7z 118.88x compression
enwiki-20110115-pages-meta-history13.xml.7z 133.07x compression
enwiki-20110115-pages-meta-history14.xml.7z 107.10x compression
enwiki-20110115-pages-meta-history15.xml.7z 83.24x compression
pages-meta-history1 has the oldest articles and also the most revisions, therefore it has the
highest compression ratio (as most revisions have only minor changes for established articles).
The pages-meta-history15 file contains the most recently created articles which have the least revisions,
but tend to have greater relative changes compared to the overall article size, and thus has the lowest 7z compression.
enwiki-20110115-pages-meta-history8.xml doesn't follow the pattern of decreasing compression ratios.
That's all I can report without actually looking inside these files! :)
cheers,
Jamie
----- Original Message -----
From: "Ariel T. Glenn" <ariel@wikimedia.org>
Date: Tuesday, March 29, 2011 11:43 pm
Subject: Re: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready
To: Jamie Morken <jmorken@shaw.ca>
Cc: xmldatadumps-l@lists.wikimedia.org, wikitech-l@lists.wikimedia.org
> The individually numbered files change sizes radically because I'm
> moving around start and end points. You can ignore that.
>
> I am looking at piece 10 however to see why it's smaller:
> ah. I have a
> typo in the size for that one, I asked for only 200000 pages to
> go in it
> instead of the 240000 I intended :-D And so that's all
> that went in
> (minus deleted pages). Nothing's missing though;
> anything "extra"
> winds up in the last piece (15). You can look at the stub
> files to
> verify that.
>
> FWIW we'll be juggling the number of pages per chunk on a
> regular basis.
>
> Ariel
>
> Στις 29-03-2011, ημέρα Τρι, και ώρα 17:08 -0700, ο/η Jamie Morken
> έγραψε:
> > Hi all,
> >
> > Congrats Ariel! :) The sum of pages-meta-history files
> for the last
> > two enwiki dumps are 342.7GB for the 20110115 dump and 353.5GB
> for the
> > 20110317 dump, which shows that the overall dump size grew
> over 2
> > months. Seven of the individually numbered pages-meta-
> history files
> > reduced in size while eight increased in size from 20110115 to
> > 20110317. By far the biggest decrease was the
> > pages-meta-history10.xml.bz2 file which dropped from 18.7GB
> down to
> > 1.9GB. I think there is probably missing revisions in
> that page ID
> > range.
> >
> > Here are some historical dumps sizes for comparison to show
> the growth
> > of these files:
> >
> > enwiki-20060816-pages-meta-history.xml.7z 5.08GB
> > enwiki-20070402-pages-meta-history.xml.7z 11.3GB (229 days since
> > previous dump)
> > enwiki-20080103-pages-meta-history.xml.7z 17.2GB (276 days since
> > previous dump)
> > enwiki-20100130-pages-meta-history.xml.7z 31.8GB (758 days since
> > previous dump)
> > enwiki-20110115-pages-meta-history[1-15].xml.7z 38.0GB (350
> days since
> > previous dump)
> > enwiki-20110115-pages-meta-history[1-15].xml.7z (7z
> compression in
> > progress)
> >
> > Here's a graph of this data showing the dump file size growth
> seems to
> > be pretty linear:
> > (chart x-axis starts from 20060816 dump and ends at 20110115 dump)
> > "http://nekrom.com/wikipedia/enwiki%20history%20dump%20file%20size%
> > 20over%20time.png"
> >
> > cheers,
> > Jamie
> >
> >
> > ----- Original Message -----
> > From: "Ariel T. Glenn" <ariel@wikimedia.org>
> > Date: Tuesday, March 29, 2011 3:24 pm
> > Subject: [Xmldatadumps-l] March 17 en wikipedia history bz2 files
> > ready
> > To: xmldatadumps-l@lists.wikimedia.org
> > Cc: wikitech-l@lists.wikimedia.org
> >
> > > Well, that used up all my good luck for the year, but the
> bz2s
> > > are ready
> > > for download. The md5sums are still calculating, give
> them
> > > a couple
> > > hours to show up. If all continues to go well we'll
> have
> > > the 7z files
> > > in 4-5 days.
> > >
> > > As before I do not plan to provide a single 350gb file of
> the
> > > bz2, nor a
> > > single 7z file for download.
> > >
> > > Happy trails,
> > >
> > > Ariel
> > >
> > >
> > > _______________________________________________
> > > Xmldatadumps-l mailing list
> > > Xmldatadumps-l@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> > >
>
>
>