I'm confused. I downloaded the 2012-12-01 dump files, but looking for known
categories I'm not finding what I expect. For example:
SELECT *
FROM `categorylinks`
WHERE `cl_to` = 'Humanities'
Yields 9 rows, but:
http://en.wikipedia.org/wiki/Category:Humanities
lists 26 subcategories and 71 pages. I'm wondering if maybe I downloaded the wrong
files, or if they didn't import completely. Here's the files and row counts, as
reported by phpMyAdmin:
1. <http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-category.sql.gz>
enwiki-20121201-category.sql.gz - ~1,544,750 rows
2.
<http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-categorylinks.sql.gz>
enwiki-20121201-categorylinks.sql.gz - ~1,380,956 rows
3. <http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-page.sql.gz>
enwiki-20121201-page.sql.gz - ~1,492,392 rows
4.
<http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-page_props.sql.gz>
enwiki-20121201-page_props.sql.gz - ~5,415,922 rows
The MD5 checksums match. What am I doing wrong?
Thanks,
Robert
-----Original Message-----
From: Ariel T. Glenn [mailto:ariel@wikimedia.org]
Sent: Thursday, January 10, 2013 10:50 AM
To: Robert Crowe
Cc: xmldatadumps-l(a)lists.wikimedia.org
Subject: RE: [Xmldatadumps-l] Which files do I need?
You want the page_props table, and look for entries with the string 'hiddencat'
for pp_propname. (*-page_props.sql.gz)
Ariel
Στις 10-01-2013, ημέρα Πεμ, και ώρα 09:58 -0800, ο/η Robert Crowe
έγραψε:
Perfect! Thanks Ariel. What is the best way to
distinguish hidden categories? I see that the category table used to have a cat_hidden
column, but that's been removed.
Robert
-----Original Message-----
From: Ariel T. Glenn [
<mailto:ariel@wikimedia.org> mailto:ariel@wikimedia.org]
Sent: Thursday, January 10, 2013 3:34 AM
To: Robert Crowe
Cc: <mailto:xmldatadumps-l@lists.wikimedia.org>
xmldatadumps-l(a)lists.wikimedia.org
Subject: Re: [Xmldatadumps-l] Which files do I need?
If you are just trying to get at the structure from
the various dump
files, the page table has page ids, titles, and
whether the page is a
redirect or not (*-page.sql.gz), the category table
has category
names, ids, and summary information
(*-category.sql.gz), and
categorylinks has the list of all category links in a
page, with the
page id and the category name
(*-categorylinks.sql.gz). You can find
details on the tables here:
<http://www.mediawiki.org/wiki/Manual:Categorylinks_table>
http://www.mediawiki.org/wiki/Manual:Categorylinks_table
(here's the category:
<http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables>
http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables )
Hopefully this should get you started.
Ariel
Στις 09-01-2013, ημέρα Τετ, και ώρα 10:51 -0800, ο/η
Robert Crowe
έγραψε:
> I'd like to mirror just the category
structure of the English
> Wikipedia, and I'm wondering which of the
dump files I need to start
> with.
>
>
>
> I don't need the page content, just the page
names, and only for the
> most current revision. I need the categories and
category members,
> and I'd like to exclude hidden categories. I
also need to
> distinguish redirects, because I don't want
to treat them as
> separate pages. As much as possible I'd like
to work with SQL
> files, but I can crunch through XML if necessary.
>
>
>
> So which files do I need to download? I may also
need some help in
> understanding the schemas.
>
>
>
> Thanks,
>
>
>
>
Robert
>
>
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
>
<mailto:Xmldatadumps-l@lists.wikimedia.org> Xmldatadumps-l(a)lists.wikimedia.org
>
<https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l>
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l