Hello everybody,
I was doing a bit of analysis of the dump enwiki-20100130-pages-meta-history.xml.7z. What I found to my surprise is that there are (at least) 7 million pages in the main namespace. I got this figure by grepping for page titles that do not contain a ":" character. Is this really the case or am I missing something? I'd seen some Wikimedia stats that said the number of articles currently is about 3.2 million, so I'm not sure why I'm seeing so many pages in the dump.
Thank you, Chrisil
Chrisil J. Arackaparambil, 29/06/2010 02:06:
I was doing a bit of analysis of the dump enwiki-20100130-pages-meta-history.xml.7z. What I found to my surprise is that there are (at least) 7 million pages in the main namespace.
http://stats.wikimedia.org/EN/TablesWikipediaEN.htm#namespaces
I got this figure by grepping for page titles that do not contain a ":" character. Is this really the case or am I missing something? I'd seen some Wikimedia stats that said the number of articles currently is about 3.2 million, so I'm not sure why I'm seeing so many pages in the dump.
http://www.mediawiki.org/wiki/Manual:Article_count
Nemo
On Mon, Jun 28, 2010 at 06:06:07PM -0600, Chrisil J. Arackaparambil wrote:
enwiki-20100130-pages-meta-history.xml.7z. What I found to my surprise is that there are (at least) 7 million pages in the main namespace. I got this figure by grepping for page titles that do not contain a ":" character. Is this really the case or am I missing something?
Your page count likely includes redirect pages. Normally article counts exclude redirects.
Greg Hewgill http://hewgill.com
Thanks everybody! I just got the figure for the number of redirects as 4.5 million: ~/7zip/p7zip_9.13/bin/7z -so e enwiki-20100130-pages-meta-history.xml.7z 2>/dev/null | perl -ne 'print if m{<redirect />}' | wc -l 4493204
Chrisil
Greg Hewgill wrote:
On Mon, Jun 28, 2010 at 06:06:07PM -0600, Chrisil J. Arackaparambil wrote:
enwiki-20100130-pages-meta-history.xml.7z. What I found to my surprise is that there are (at least) 7 million pages in the main namespace. I got this figure by grepping for page titles that do not contain a ":" character. Is this really the case or am I missing something?
Your page count likely includes redirect pages. Normally article counts exclude redirects.
Greg Hewgill http://hewgill.com
xmldatadumps-l@lists.wikimedia.org