I'm confused.  I downloaded the 2012-12-01 dump files, but looking for known categories I'm not finding what I expect.  For example:

 

SELECT *

FROM  `categorylinks`

WHERE  `cl_to` =  'Humanities'

 

Yields 9 rows, but:

 

http://en.wikipedia.org/wiki/Category:Humanities

 

lists 26 subcategories and 71 pages.  I'm wondering if maybe I downloaded the wrong files, or if they didn't import completely.  Here's the files and row counts, as reported by phpMyAdmin:

 

1.    enwiki-20121201-category.sql.gz - ~1,544,750 rows

2.    enwiki-20121201-categorylinks.sql.gz - ~1,380,956 rows

3.    enwiki-20121201-page.sql.gz - ~1,492,392 rows

4.    enwiki-20121201-page_props.sql.gz - ~5,415,922 rows

 

The MD5 checksums match.  What am I doing wrong?

 

Thanks,

 

Robert

 

 

 

-----Original Message-----
From: Ariel T. Glenn [mailto:ariel@wikimedia.org]
Sent: Thursday, January 10, 2013 10:50 AM
To: Robert Crowe
Cc: xmldatadumps-l@lists.wikimedia.org
Subject: RE: [Xmldatadumps-l] Which files do I need?

 

You want the page_props table, and look for entries with the string 'hiddencat' for pp_propname.  (*-page_props.sql.gz)

 

Ariel

 

Στις 10-01-2013, ημέρα Πεμ, και ώρα 09:58 -0800, ο/η Robert Crowe

έγραψε:

> Perfect!  Thanks Ariel.  What is the best way to distinguish hidden categories?  I see that the category table used to have a cat_hidden column, but that's been removed.

>

> Robert

>

 

> -----Original Message-----

> From: Ariel T. Glenn [mailto:ariel@wikimedia.org]

> Sent: Thursday, January 10, 2013 3:34 AM

> To: Robert Crowe

> Cc: xmldatadumps-l@lists.wikimedia.org

> Subject: Re: [Xmldatadumps-l] Which files do I need?

>

> If you are just trying to get at the structure from the various dump

> files, the page table has page ids, titles, and whether the page is a

> redirect or not (*-page.sql.gz), the category table has category

> names, ids, and summary information (*-category.sql.gz), and

> categorylinks has the list of all category links in a page, with the

> page id and the category name (*-categorylinks.sql.gz).  You can find

> details on the tables here:

> http://www.mediawiki.org/wiki/Manual:Categorylinks_table

> (here's the category:

> http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables )

>

> Hopefully this should get you started.

>

> Ariel

>

> Στις 09-01-2013, ημέρα Τετ, και ώρα 10:51 -0800, ο/η Robert Crowe

> έγραψε:

> > I'd like to mirror just the category structure of the English

> > Wikipedia, and I'm wondering which of the dump files I need to start

> > with.

> >

> > 

> >

> > I don't need the page content, just the page names, and only for the

> > most current revision.  I need the categories and category members,

> > and I'd like to exclude hidden categories.  I also need to

> > distinguish redirects, because I don't want to treat them as

> > separate pages.  As much as possible I'd like to work with SQL

> > files, but I can crunch through XML if necessary.

> >

> > 

> >

> > So which files do I need to download?  I may also need some help in

> > understanding the schemas.

> >

> > 

> >

> > Thanks,

> >

> > 

> >

> > Robert

> >

> > 

> >

> >

> > _______________________________________________

> > Xmldatadumps-l mailing list

> > Xmldatadumps-l@lists.wikimedia.org

> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

>

>