Martina Greiner wrote:
I would like to download the english version of
Wikipedia only
and have several questions I could not get answered.
en.wikipedia.org/wiki/Database_download says that the
english version should be 11GB compressed
and about 40 GB uncompressed.
On
download.wikimedia.org/wikipedia/en I can only
find files with 31 GB (cur+old).
That figure is quoted from September 2004; if it was accurate then, it's
certainly not accurate now. :) Wikipedia has a very high rate of growth.
1. Is this really the english version only or for all
wikipedias?
Yes, files in that directory are for
en.wikipedia.org only.
2. If it is the english version, why is it that big?
cur and old combined include the more or less complete edit history of
every page on the site since January 2001. We have a lot of pages, and
some of them have been edited many, many times.
The English Wikipedia is our oldest, largest, and most popular site.
It's larger than all the others we run individually (I'm not sure
offhand whether it's still larger than all others combined.)
3. I assume this is a compressed file, thus it will be
really,
really big when uncompressed.
The old database will be only slightly bigger after decompression, since
its contents themselves are currently stored compressed. Normally you
would not store the uncompressed dump, however: you would feed it
directly to the database during decompression.
4. Is it enough to download cur and old tables in
order to
get all available tables?
That depends on what you want. The current version of every page as of
the backup date is contained in the cur table. The older revisions are
in the old table.
Additional data is in the links tables, etc; depending on what you're
doing you may not care about these. (The links tables keep track of
which pages link to which other pages and resources.) These can be
regenerated from the actual pages, but it takes a long time.
Images are also separate.
-- brion vibber (brion @
pobox.com)