Jamie Morken wrote:
Hi,
Thanks for the info, while I was at it I did some more checking of the history dump file sizes and compression ratios (as reported by 7-Zip 9.20):
enwiki-20110115-pages-meta-history1.xml.7z 434.99x compression enwiki-20110115-pages-meta-history2.xml.7z 289.46x compression enwiki-20110115-pages-meta-history3.xml.7z 248.72x compression enwiki-20110115-pages-meta-history4.xml.7z 216.29x compression enwiki-20110115-pages-meta-history5.xml.7z 198.67x compression enwiki-20110115-pages-meta-history6.xml.7z 176.94x compression enwiki-20110115-pages-meta-history7.xml.7z 161.42x compression enwiki-20110115-pages-meta-history8.xml.7z 208.59x compression enwiki-20110115-pages-meta-history9.xml.7z 126.86x compression enwiki-20110115-pages-meta-history10.xml.7z 112.10x compression enwiki-20110115-pages-meta-history11.xml.7z 117.27x compression enwiki-20110115-pages-meta-history12.xml.7z 118.88x compression enwiki-20110115-pages-meta-history13.xml.7z 133.07x compression enwiki-20110115-pages-meta-history14.xml.7z 107.10x compression enwiki-20110115-pages-meta-history15.xml.7z 83.24x compression
pages-meta-history1 has the oldest articles and also the most revisions, therefore it has the highest compression ratio (as most revisions have only minor changes for established articles). The pages-meta-history15 file contains the most recently created articles which have the least revisions, but tend to have greater relative changes compared to the overall article size, and thus has the lowest 7z compression.
enwiki-20110115-pages-meta-history8.xml doesn't follow the pattern of decreasing compression ratios.
Maybe it contains many bot created articles?
That's all I can report without actually looking inside these files! :)
cheers, Jamie