Hi,
Thanks for the info, while I was at it I did some more checking of the history dump file
sizes and compression ratios (as reported by 7-Zip 9.20):
enwiki-20110115-pages-meta-history1.xml.7z 434.99x compression
enwiki-20110115-pages-meta-history2.xml.7z 289.46x compression
enwiki-20110115-pages-meta-history3.xml.7z 248.72x compression
enwiki-20110115-pages-meta-history4.xml.7z 216.29x compression
enwiki-20110115-pages-meta-history5.xml.7z 198.67x compression
enwiki-20110115-pages-meta-history6.xml.7z 176.94x compression
enwiki-20110115-pages-meta-history7.xml.7z 161.42x compression
enwiki-20110115-pages-meta-history8.xml.7z 208.59x compression
enwiki-20110115-pages-meta-history9.xml.7z 126.86x compression
enwiki-20110115-pages-meta-history10.xml.7z 112.10x compression
enwiki-20110115-pages-meta-history11.xml.7z 117.27x compression
enwiki-20110115-pages-meta-history12.xml.7z 118.88x compression
enwiki-20110115-pages-meta-history13.xml.7z 133.07x compression
enwiki-20110115-pages-meta-history14.xml.7z 107.10x compression
enwiki-20110115-pages-meta-history15.xml.7z 83.24x compression
pages-meta-history1 has the oldest articles and also the most revisions, therefore it has
the
highest compression ratio (as most revisions have only minor changes for established
articles).
The pages-meta-history15 file contains the most recently created articles which have the
least revisions,
but tend to have greater relative changes compared to the overall article size, and thus
has the lowest 7z compression.
enwiki-20110115-pages-meta-history8.xml doesn't follow the pattern of decreasing
compression ratios.
That's all I can report without actually looking inside these files! :)
cheers,
Jamie
----- Original Message -----
From: "Ariel T. Glenn" <ariel(a)wikimedia.org>
Date: Tuesday, March 29, 2011 11:43 pm
Subject: Re: [Xmldatadumps-l] March 17 en wikipedia history bz2 files ready
To: Jamie Morken <jmorken(a)shaw.ca>
Cc: xmldatadumps-l(a)lists.wikimedia.org, wikitech-l(a)lists.wikimedia.org
The individually numbered files change sizes radically
because I'm
moving around start and end points. You can ignore that.
I am looking at piece 10 however to see why it's smaller:
ah. I have a
typo in the size for that one, I asked for only 200000 pages to
go in it
instead of the 240000 I intended :-D And so that's all
that went in
(minus deleted pages). Nothing's missing though;
anything "extra"
winds up in the last piece (15). You can look at the stub
files to
verify that.
FWIW we'll be juggling the number of pages per chunk on a
regular basis.
Ariel
Στις 29-03-2011, ημέρα Τρι, και ώρα 17:08 -0700, ο/η Jamie Morken
έγραψε:
Hi all,
Congrats Ariel! :) The sum of pages-meta-history files
for the last
two enwiki dumps are 342.7GB for the 20110115
dump and 353.5GB
for the
20110317 dump, which shows that the overall dump
size grew
over 2
months. Seven of the individually numbered
pages-meta-
history files
reduced in size while eight increased in size
from 20110115 to
20110317. By far the biggest decrease was the
pages-meta-history10.xml.bz2 file which dropped from 18.7GB
down to
1.9GB. I think there is probably missing
revisions in
that page ID
range.
Here are some historical dumps sizes for comparison to show
the growth
of these files:
enwiki-20060816-pages-meta-history.xml.7z 5.08GB
enwiki-20070402-pages-meta-history.xml.7z 11.3GB (229 days since
previous dump)
enwiki-20080103-pages-meta-history.xml.7z 17.2GB (276 days since
previous dump)
enwiki-20100130-pages-meta-history.xml.7z 31.8GB (758 days since
previous dump)
enwiki-20110115-pages-meta-history[1-15].xml.7z 38.0GB (350
days since
previous dump)
enwiki-20110115-pages-meta-history[1-15].xml.7z (7z
compression in
progress)
Here's a graph of this data showing the dump file size growth
seems to
be pretty linear:
(chart x-axis starts from 20060816 dump and ends at 20110115 dump)
"http://nekrom.com/wikipedia/enwiki%20history%20dump%20file%20size%
20over%20time.png"
cheers,
Jamie
----- Original Message -----
From: "Ariel T. Glenn" <ariel(a)wikimedia.org>
Date: Tuesday, March 29, 2011 3:24 pm
Subject: [Xmldatadumps-l] March 17 en wikipedia history bz2 files
ready
To: xmldatadumps-l(a)lists.wikimedia.org
Cc: wikitech-l(a)lists.wikimedia.org
> Well, that used up all my good luck for the year, but the
bz2s
> are ready
> for download. The md5sums are still calculating, give
them
> a couple
> hours to show up. If all continues to go well we'll
have
> the 7z files
> in 4-5 days.
>
> As before I do not plan to provide a single 350gb file of
the
> bz2, nor a
> single 7z file for download.
>
> Happy trails,
>
> Ariel
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>