Hi!
Once there was the size of all Wikipedia database dumps at download.wikipedia.org. In March 9, 2005 I noticed 50 Gigabytes compressed (15 for just current). I wonder what the actual numbers are. I assume the change to XML does not make a big difference because data is compressed but it don't remember if it was compressed with gip oder bzip2. Anyway a total number would be interesting to give an impression of how much information we collect. Can anyone with access to the server please run a simple shell script to get the size you need if you wanted to download all wikipedias data?
Thanks and greetings! Jakob
P.S: by the way 20050713_pages_full.xml.bz2 (29.9G) seems to be the newest full dump of english wikipedia but I bet you don't need a Terabyte for all compressed full dumps - yet. I found the first plans for RAID systems with several Terabytes here: http://meta.wikimedia.org/wiki/Wikimedia_servers/hardware_orders/wishlist
On 10/17/05, Jakob Voss jakob.voss@nichtich.de wrote:
Hi!
Once there was the size of all Wikipedia database dumps at download.wikipedia.org.
I was just pondering this yesterday. Samuel is master with 6x73 GB= 438 GB. Of course, that's not in dump form.
Jakob, this ties in with the earlier request for wikipedia-by-mail. I was thinking of doing a Fundable.org drive for an array so that I could serve those requests, but perhaps using the Tool server makes more sense...
Jeremy Dunck wrote:
On 10/17/05, Jakob Voss jakob.voss@nichtich.de wrote:
Hi!
Once there was the size of all Wikipedia database dumps at download.wikipedia.org.
I was just pondering this yesterday. Samuel is master with 6x73 GB= 438 GB. Of course, that's not in dump form.
Jakob, this ties in with the earlier request for wikipedia-by-mail. I was thinking of doing a Fundable.org drive for an array so that I could serve those requests, but perhaps using the Tool server makes more sense...
Samuel's InnoDB data files are about 290 GB, but it's likely most of that is free space. There's also about 100 GB distributed across our external storage DBs; the hypothesized free space in samuel is because we moved a lot of the text out of it and into external storage. It's all compressed with gzip.
The current total size of all the pages_full.xml.bz2 files from the latest dump is 14 GB. In total, the wikipedia directory on the download server is using 236 GB, thanks mostly to image tarballs and poorly compressed copies of the text.
-- Tim Starling
Tim Starling wrote:
Jakob, this ties in with the earlier request for wikipedia-by-mail. I was thinking of doing a Fundable.org drive for an array so that I could serve those requests, but perhaps using the Tool server makes more sense...
The current total size of all the pages_full.xml.bz2 files from the latest dump is 14 GB. In total, the wikipedia directory on the download server is using 236 GB, thanks mostly to image tarballs and poorly compressed copies of the text.
Thanks! I thnik you calculated this number with the current, failed en dump of 1 GB because 20050924_pages_full.xml.bz2 is 11.3 GB so it should be around 25 GB for all pages_full.xml.bz2 (~5 GB with 7zip) or ~400 GB decompressed. Seems like thanks to improving compressing algorithms the Terabyte disk array can wait until end of next year. Spending more server time on better compression is better than making people spend more time on downloading.
Greetings, Jakob
P.S: You could only additionally provide parts of the dumps like current, articles, fulll, titles ... Maybe articles_full (version history of articles only) could be of use but I don't know.
wikitech-l@lists.wikimedia.org