Aryeh Gregor wrote:
On Fri, Mar 13, 2009 at 2:44 PM, O. O. olson_ot@yahoo.com wrote:
Hi, I am looking at the dump of the English Wikipedia at http://download.wikimedia.org/enwiki/20081008/ There is a file called “all-titles-in-ns0.gz” which is supposed to contain the List of Page Titles. If I do
cat enwiki-20081008-all-titles-in-ns0 | wc -l
I get 5716820. On the same page, a little above in “pages-articles.xml.bz2” we have “enwiki 7649051 pages”.
The description for pages-articles.xml.bz2 says it contains "Articles, templates, image descriptions, and primary meta-pages." all-titles-in-ns0.gz contains (as the name suggests) only the titles in ns0, i.e., the main namespace, articles. It does not contain templates, image descriptions, or "primary meta-pages" (whatever those are).
Thanks Ilmari and Aryeh.
I am not sure what are “primary meta-pages” – however “templates”, and “image descriptions” do have Titles. You can check this in the online version of the English Wikipedia.
O. O.