I think you're actually referring to 30 January 2010 dump :-).

Please, also beware of some problems with missing revisions in the latests dumps.

Fwding message from  Wikimedia-XML-dumps mailing list. Ariel T. Glenn also reported that this was caused by a bug solved during that run.

In addition, enwiki-20100312.7z is 15.8 GB, whereas enwiki-20100130.7z is 31.9GB. The first one only contains half of the total number of revisions, so January 2010 dump is the best one we have, so far, despite revisions with missing text (not because of vandalism, but due to backup problems).

From: Dmitri Chichkov

By comparing two archives (IMO [enwiki-20100312 15.8 GB] doesn't seem to have empty-revisions due to backup problem) You can estimate that ~0.2% of all revisions are empty due to vandalism/etc and in the [enwiki-20100130 31.9 GB] file additional ~0.4% are missing due to backup failures.
[enwiki-20100130 31.9 GB] Revisions 313797035. Empty Revisions 1524837.
[enwiki-20100312 15.8 GB] Revisions 184986173. Empty Revisions 370982
[enwiki-20100130 31.9 GB] Revisions 185000000. Empty Revisions 1158890. (same position in the the archive)

You can also look at some single article - ie 'Anarchism' article. In the [enwiki-20100130 31.9 GB] it have 15180 revisions, 624 of them are empty = 4%. In the [enwiki-20100312 15.8 GB] it have 15261 revisions, only 8 revisions are empty - 0.05%.




--- El mié, 2/6/10, Brian J Mingus <Brian.Mingus@Colorado.EDU> escribió:

De: Brian J Mingus <Brian.Mingus@Colorado.EDU>
Asunto: Re: [Wiki-research-l] actual size of 30 may 2010 dump
Para: aforte@gatech.edu, "Research into Wikimedia content and communities" <wiki-research-l@lists.wikimedia.org>
Fecha: miércoles, 2 de junio, 2010 20:33

On Wed, Jun 2, 2010 at 12:02 PM, Andrea Forte <andrea.forte@gmail.com> wrote:
Hi all, anyone have a close estimate (or exact number) for the size of
the 30 May 2010 enwiki dump once unzipped?

>5TB is what it says here [http://en.wikipedia.org/wiki/Wikipedia_database#Latest_complete_dump_of_English_Wikipedia] but that really does leave a lot of possibilities. :)


I found an e-mail on wikitech-l that reports 5.34158501 terabytes.

-----Adjunto en línea a continuación-----

Wiki-research-l mailing list