I was doing a bit of analysis of the dump
enwiki-20100130-pages-meta-history.xml.7z. What I found to my surprise
is that there are (at least) 7 million pages in the main namespace. I
got this figure by grepping for page titles that do not contain a ":"
character. Is this really the case or am I missing something? I'd seen
some Wikimedia stats that said the number of articles currently is about
3.2 million, so I'm not sure why I'm seeing so many pages in the dump.
This went off-list for some reason...
---------- Forwarded message ----------
From: Thomas Dalton <thomas.dalton(a)gmail.com>
Date: 29 June 2010 01:18
Subject: Re: [Xmldatadumps-l] Number of pages on Wikipedia
To: "Chrisil J. Arackaparambil" <chrisil(a)lanl.gov>
On 29 June 2010 01:06, Chrisil J. Arackaparambil <chrisil(a)lanl.gov> wrote:
> Hello everybody,
> I was doing a bit of analysis of the dump
> enwiki-20100130-pages-meta-history.xml.7z. What I found to my surprise
> is that there are (at least) 7 million pages in the main namespace. I
> got this figure by grepping for page titles that do not contain a ":"
> character. Is this really the case or am I missing something? I'd seen
> some Wikimedia stats that said the number of articles currently is about
> 3.2 million, so I'm not sure why I'm seeing so many pages in the dump.
The 3.2 million figure does not include redirects.
I'd downloaded the enwiki-20100312-pages-meta-history.xml.7z dump a
couple of days ago, but now it seems the link has been taken down:
On browsing the archives of this list I also found some comments
indicating that this dump had some problems.
I'd appreciate it if someone could please tell me what the status of
this dump is, and which would be the most recent dump in good shape.
Also, if the 20100312 dump is bad, perhaps there should be a message on
that page indicating it so that people don't mistakenly take it for a
For all those watching our snapshots there will be some down time today as we transition to a new storage node.
We've been testing on the node for quite a while and were finally ready to migrate over.
Todays schedule includes:
- final syncs between storage2 -> dataset1
- stopping current system
- changing configs
- starting snapshots system
- migrating download.
We'll keep you updated as the transition progresses.
I have restarted the frwiki run again and the dump for plwiki which I
inadvertently interrupted; I apologize for the inconvenience. These are
now both running so that they can be left unattended.
Does anyone here have one of the dumps of test.wikipedia from Jan or Feb
of this year? If so please let me know off list, I'd like to do some
comparisons with the current run. Thanks.
Also, in reply to Jamie Morken, the missing text for the 2005 revisions
in the Jan 2010 en wikipedia dump was caused by a bug that that was
fixed midway during that run (see the techblog post for specifics).