From CNET interview to Brion
http://news.cnet.com/8301-17939_109-10103177-2.html
The text alone is less 500 MB compressed.
That statement struck me, as I wouldn't think that big wikis could fit on that, much less all wikis.
So I went and spent some CPU on calculations:
I first looked at dewiki: $ 7z e -so dewiki-20081011-pages-meta-history.xml.7z|sed -n 's/\s*<text xml:space="preserve">([^<]*)(</text>)?/\1/gp'| bzip2 -9 | wc -c 325915907 bytes = 310.8 MB
Not bad for a 5.1 GB 7z file. :)
Then I to enwiki, begining with the current versions: $ bzcat enwiki-20081008-pages-meta-current.xml.bz2|sed -n 's/\s*<text xml:space="preserve">([^<]*)(</text>)?/\1/gp'|bzip2 -9 | wc -c 253648578
253648578 bytes = 241.898 MB
Again, a gigantic file (7.8 GB bz2) was reduced to less than 500MB. Maybe it *can* be done after all. There're much more revisions, but the compression ratio is greater.
So I had to go to turn to the beast, enwiki history files. As there hasn't been any successful enwiki history dump on the last months, I used an old dump I had, which is nearly a year old and fills 18G.
$ 7z e -so enwiki-20080103-pages-meta-history.xml.7z |sed -n 's/\s*<text xml:space="preserve">([^<]*)(</text>)?/\1/gp'|bzip2 -9 | wc -c
1092104465 bytes = 1041.5 MB = 1.01 GB
So, where did those 'less than 500MB' numbers came from? Also note that I used bzip2 instead of gzip, so external storage will be using much more space (plus indexes, ids...).
Nonetheless, the results are impressive on how the size of *already compressed files* get reduced just by reducing the metadata.
As a comparison, dewiki-20081011-stub-meta-history.xml.gz containing the remaining metadata is 1.7GB. 1.7 GB + 310.8 MB is still much less than the 51.4 GB of dewiki-20081011-pages-meta-history.xml.bz2!
Maybe we should investigate new ways of storing the dumps compressed. Could we achieve similar gains increasing the bzip window size to counteract the noise of revision metadata? Or perhaps I used a wrong regex and thus large chunks of data were not taken into account ?