I'm curious as to the size (in bytes) of the current Wikipedia.
That is, if one took a snapshot of the Wikipedia in web form (including markup, images, multimedia, etc.), how large would it be? If the web documents were compressed, then how large would it be? (This would not include edit history information which I assume is substantial -- only interested in a snapshot of the current pages.)
Referring to:
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons
I see the following statistics:
Total text (no markup): 3 gigabytes (rough estimate) Photos/illustrations: 726,000
I don't have a good feel for the size of a typical photo/illustration. Assuming the display version (in jpeg) is typically 10 kbytes, then the size would be approximate 7 gigabytes.
I also don't have a feel for what the text would compressed to (including some sort of markup), but with added markup plus compression, the overall size would remain the same, or maybe a little smaller.
So it looks like 10 gigabytes is a very rough estimate.
Am I anywhere close? Or am I forgetting something?
Thanks.
Jon Noring
On 1/12/07, Jon Noring jon@noring.name wrote:
So it looks like 10 gigabytes is a very rough estimate.
Am I anywhere close? Or am I forgetting something?
Other important factors might include whether we'd be including page histories and logs in the size estimate -- page history, of course, includes every version of a page ever produced, so I'd imagine the amount of available text would see a pretty big increase. Deleted edits (or even media) would likewise make a pretty decent jump, I bet, but the case for including those is a bit more slim.
But overall, the specifics on this subject are pretty far from the sort of things I really know about.
-Luna
On 1/13/07, Jon Noring jon@noring.name wrote:
I'm curious as to the size (in bytes) of the current Wikipedia.
That is, if one took a snapshot of the Wikipedia in web form (including markup, images, multimedia, etc.), how large would it be? If the web documents were compressed, then how large would it be? (This would not include edit history information which I assume is substantial -- only interested in a snapshot of the current pages.)
You can go to http://download.wikipedia.org/ and look at the static downloads. The current html dowload for the English wikipedia is about 5.5 GB, compressed with 7-zip.
In the tables at http://stats.wikimedia.org/EN/TablesWikipediaEN.htm the number of "binaries" (images, audio etc) was 620k last October. I don't see the actual size in bytes anywhere.
Alfio
Talking about a related issue... circulation of Wikipedia content through non-Internet means.
Webaroo [http://www.webaroo.com/] has this set of interesting Wikipedia web packs [http://www.webaroo.com/category/wiki2go_9] Has someone checked them out? The last time around, unfortunately, they were only Windows-compatible, so quite useless for my Free Software-based GNU/Linux system. I was eager to check one which came with an Indian computer magazine, here in Goa. --FN
On 1/13/07, Frederick Noronha fred@bytesforall.org wrote:
Talking about a related issue... circulation of Wikipedia content through non-Internet means.
Webaroo [http://www.webaroo.com/] has this set of interesting Wikipedia web packs [http://www.webaroo.com/category/wiki2go_9] Has someone checked them out? The last time around, unfortunately, they were only Windows-compatible, so quite useless for my Free Software-based GNU/Linux system. I was eager to check one which came with an Indian computer magazine, here in Goa. --FN
Actually, on the front page there is something like a whole wikipedia web pack - 4.3 GB
Alfio
wikipedia-l@lists.wikimedia.org