From CNET interview to Brion
http://news.cnet.com/8301-17939_109-10103177-2.html
The text alone is less 500 MB compressed.
That statement struck me, as I wouldn't think that big wikis could fit on that, much less all wikis.
So I went and spent some CPU on calculations:
I first looked at dewiki: $ 7z e -so dewiki-20081011-pages-meta-history.xml.7z|sed -n 's/\s*<text xml:space="preserve">([^<]*)(</text>)?/\1/gp'| bzip2 -9 | wc -c 325915907 bytes = 310.8 MB
Not bad for a 5.1 GB 7z file. :)
Then I to enwiki, begining with the current versions: $ bzcat enwiki-20081008-pages-meta-current.xml.bz2|sed -n 's/\s*<text xml:space="preserve">([^<]*)(</text>)?/\1/gp'|bzip2 -9 | wc -c 253648578
253648578 bytes = 241.898 MB
Again, a gigantic file (7.8 GB bz2) was reduced to less than 500MB. Maybe it *can* be done after all. There're much more revisions, but the compression ratio is greater.
So I had to go to turn to the beast, enwiki history files. As there hasn't been any successful enwiki history dump on the last months, I used an old dump I had, which is nearly a year old and fills 18G.
$ 7z e -so enwiki-20080103-pages-meta-history.xml.7z |sed -n 's/\s*<text xml:space="preserve">([^<]*)(</text>)?/\1/gp'|bzip2 -9 | wc -c
1092104465 bytes = 1041.5 MB = 1.01 GB
So, where did those 'less than 500MB' numbers came from? Also note that I used bzip2 instead of gzip, so external storage will be using much more space (plus indexes, ids...).
Nonetheless, the results are impressive on how the size of *already compressed files* get reduced just by reducing the metadata.
As a comparison, dewiki-20081011-stub-meta-history.xml.gz containing the remaining metadata is 1.7GB. 1.7 GB + 310.8 MB is still much less than the 51.4 GB of dewiki-20081011-pages-meta-history.xml.bz2!
Maybe we should investigate new ways of storing the dumps compressed. Could we achieve similar gains increasing the bzip window size to counteract the noise of revision metadata? Or perhaps I used a wrong regex and thus large chunks of data were not taken into account ?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Platonides wrote:
So, where did those 'less than 500MB' numbers came from?
Off the top of my head, referring to compressed size of text of current article pages only. Looks like enwiki has expanded a bit since I last looked (4.1 GB). :)
- -- brion
On Wed, Dec 3, 2008 at 7:43 PM, Platonides Platonides@gmail.com wrote: [snip]
Or perhaps I used a wrong regex and thus large chunks of data were not taken into account ?
Yes.
wikitech-l@lists.wikimedia.org