I'd like to mirror just the category structure of the English Wikipedia, and I'm wondering which of the dump files I need to start with.
I don't need the page content, just the page names, and only for the most current revision. I need the categories and category members, and I'd like to exclude hidden categories. I also need to distinguish redirects, because I don't want to treat them as separate pages. As much as possible I'd like to work with SQL files, but I can crunch through XML if necessary.
So which files do I need to download? I may also need some help in understanding the schemas.
Thanks,
Robert
If you are just trying to get at the structure from the various dump files, the page table has page ids, titles, and whether the page is a redirect or not (*-page.sql.gz), the category table has category names, ids, and summary information (*-category.sql.gz), and categorylinks has the list of all category links in a page, with the page id and the category name (*-categorylinks.sql.gz). You can find details on the tables here: http://www.mediawiki.org/wiki/Manual:Categorylinks_table (here's the category: http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables )
Hopefully this should get you started.
Ariel
Στις 09-01-2013, ημέρα Τετ, και ώρα 10:51 -0800, ο/η Robert Crowe έγραψε:
I'd like to mirror just the category structure of the English Wikipedia, and I'm wondering which of the dump files I need to start with.
I don't need the page content, just the page names, and only for the most current revision. I need the categories and category members, and I'd like to exclude hidden categories. I also need to distinguish redirects, because I don't want to treat them as separate pages. As much as possible I'd like to work with SQL files, but I can crunch through XML if necessary.
So which files do I need to download? I may also need some help in understanding the schemas.
Thanks,
Robert
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Perfect! Thanks Ariel. What is the best way to distinguish hidden categories? I see that the category table used to have a cat_hidden column, but that's been removed.
Robert
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Thursday, January 10, 2013 3:34 AM To: Robert Crowe Cc: xmldatadumps-l@lists.wikimedia.org Subject: Re: [Xmldatadumps-l] Which files do I need?
If you are just trying to get at the structure from the various dump files, the page table has page ids, titles, and whether the page is a redirect or not (*-page.sql.gz), the category table has category names, ids, and summary information (*-category.sql.gz), and categorylinks has the list of all category links in a page, with the page id and the category name (*-categorylinks.sql.gz). You can find details on the tables here: http://www.mediawiki.org/wiki/Manual:Categorylinks_table (here's the category: http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables )
Hopefully this should get you started.
Ariel
Στις 09-01-2013, ημέρα Τετ, και ώρα 10:51 -0800, ο/η Robert Crowe έγραψε:
I'd like to mirror just the category structure of the English Wikipedia, and I'm wondering which of the dump files I need to start with.
I don't need the page content, just the page names, and only for the most current revision. I need the categories and category members, and I'd like to exclude hidden categories. I also need to distinguish redirects, because I don't want to treat them as separate pages. As much as possible I'd like to work with SQL files, but I can crunch through XML if necessary.
So which files do I need to download? I may also need some help in understanding the schemas.
Thanks,
Robert
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
You want the page_props table, and look for entries with the string 'hiddencat' for pp_propname. (*-page_props.sql.gz)
Ariel
Στις 10-01-2013, ημέρα Πεμ, και ώρα 09:58 -0800, ο/η Robert Crowe έγραψε:
Perfect! Thanks Ariel. What is the best way to distinguish hidden categories? I see that the category table used to have a cat_hidden column, but that's been removed.
Robert
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Thursday, January 10, 2013 3:34 AM To: Robert Crowe Cc: xmldatadumps-l@lists.wikimedia.org Subject: Re: [Xmldatadumps-l] Which files do I need?
If you are just trying to get at the structure from the various dump files, the page table has page ids, titles, and whether the page is a redirect or not (*-page.sql.gz), the category table has category names, ids, and summary information (*-category.sql.gz), and categorylinks has the list of all category links in a page, with the page id and the category name (*-categorylinks.sql.gz). You can find details on the tables here: http://www.mediawiki.org/wiki/Manual:Categorylinks_table (here's the category: http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables )
Hopefully this should get you started.
Ariel
Στις 09-01-2013, ημέρα Τετ, και ώρα 10:51 -0800, ο/η Robert Crowe έγραψε:
I'd like to mirror just the category structure of the English Wikipedia, and I'm wondering which of the dump files I need to start with.
I don't need the page content, just the page names, and only for the most current revision. I need the categories and category members, and I'd like to exclude hidden categories. I also need to distinguish redirects, because I don't want to treat them as separate pages. As much as possible I'd like to work with SQL files, but I can crunch through XML if necessary.
So which files do I need to download? I may also need some help in understanding the schemas.
Thanks,
Robert
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Hi Ariel,
I'm trying to understand the foreign keys. Is the page_props.pp_page column the key for category.cat_id? Or is it the key for page.page_id where page_namespace = 14?
Thanks,
Robert
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Thursday, January 10, 2013 10:50 AM To: Robert Crowe Cc: xmldatadumps-l@lists.wikimedia.org Subject: RE: [Xmldatadumps-l] Which files do I need?
You want the page_props table, and look for entries with the string 'hiddencat' for pp_propname. (*-page_props.sql.gz)
Ariel
Στις 10-01-2013, ημέρα Πεμ, και ώρα 09:58 -0800, ο/η Robert Crowe έγραψε:
Perfect! Thanks Ariel. What is the best way to distinguish hidden categories? I see that the category table used to have a cat_hidden column, but that's been removed.
Robert
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Thursday, January 10, 2013 3:34 AM To: Robert Crowe Cc: xmldatadumps-l@lists.wikimedia.org Subject: Re: [Xmldatadumps-l] Which files do I need?
If you are just trying to get at the structure from the various dump files, the page table has page ids, titles, and whether the page is a redirect or not (*-page.sql.gz), the category table has category names, ids, and summary information (*-category.sql.gz), and categorylinks has the list of all category links in a page, with the page id and the category name (*-categorylinks.sql.gz). You can find details on the tables here: http://www.mediawiki.org/wiki/Manual:Categorylinks_table (here's the category: http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables )
Hopefully this should get you started.
Ariel
Στις 09-01-2013, ημέρα Τετ, και ώρα 10:51 -0800, ο/η Robert Crowe έγραψε:
I'd like to mirror just the category structure of the English Wikipedia, and I'm wondering which of the dump files I need to start with.
I don't need the page content, just the page names, and only for the most current revision. I need the categories and category members, and I'd like to exclude hidden categories. I also need to distinguish redirects, because I don't want to treat them as separate pages. As much as possible I'd like to work with SQL files, but I can crunch through XML if necessary.
So which files do I need to download? I may also need some help in understanding the schemas.
Thanks,
Robert
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Στις 12-01-2013, ημέρα Σαβ, και ώρα 17:00 -0800, ο/η Robert Crowe έγραψε:
Hi Ariel,
I'm trying to understand the foreign keys. Is the page_props.pp_page column the key for category.cat_id? Or is it the key for page.page_id where page_namespace = 14?
Thanks,
Robert
pp_page is the page_id of the page which has the particular property set.
SELECT *
FROM `categorylinks`
WHERE `cl_to` = 'Humanities'
Yields 9 rows
In the original copy of enwiki-20121201-categorylinks.sql.gz I see 96 lines with 'Humanities' for cl_to. Did you try grepping them out and having a look?
Ariel
So that tells me that the import didn't complete correctly. I'll work on that. Is there anywhere to get row counts for the tables to confirm the import?
Robert
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Monday, January 14, 2013 12:38 AM To: Robert Crowe Cc: xmldatadumps-l@lists.wikimedia.org Subject: RE: [Xmldatadumps-l] Which files do I need?
Στις 12-01-2013, ημέρα Σαβ, και ώρα 17:00 -0800, ο/η Robert Crowe έγραψε:
Hi Ariel,
I'm trying to understand the foreign keys. Is the page_props.pp_page column the key for category.cat_id? Or is it the key for page.page_id where page_namespace = 14?
Thanks,
Robert
pp_page is the page_id of the page which has the particular property set.
SELECT *
FROM `categorylinks`
WHERE `cl_to` = 'Humanities'
Yields 9 rows
In the original copy of enwiki-20121201-categorylinks.sql.gz I see 96 lines with 'Humanities' for cl_to. Did you try grepping them out and having a look?
Ariel
Στις 14-01-2013, ημέρα Δευ, και ώρα 09:11 -0800, ο/η Robert Crowe έγραψε:
So that tells me that the import didn't complete correctly. I'll work on that. Is there anywhere to get row counts for the tables to confirm the import?
Robert
I'd just count the rows in the sql file (sed it to break the lines appropriately and wc- l or some such, it would not be perfect but would surely be 'close enough').
Ariel
I'm confused. I downloaded the 2012-12-01 dump files, but looking for known categories I'm not finding what I expect. For example:
SELECT *
FROM `categorylinks`
WHERE `cl_to` = 'Humanities'
Yields 9 rows, but:
http://en.wikipedia.org/wiki/Category:Humanities
lists 26 subcategories and 71 pages. I'm wondering if maybe I downloaded the wrong files, or if they didn't import completely. Here's the files and row counts, as reported by phpMyAdmin:
1. http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-category.sql.gz enwiki-20121201-category.sql.gz - ~1,544,750 rows
2. http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-categorylinks.sql.gz enwiki-20121201-categorylinks.sql.gz - ~1,380,956 rows
3. http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-page.sql.gz enwiki-20121201-page.sql.gz - ~1,492,392 rows
4. http://dumps.wikimedia.org/enwiki/20121201/enwiki-20121201-page_props.sql.gz enwiki-20121201-page_props.sql.gz - ~5,415,922 rows
The MD5 checksums match. What am I doing wrong?
Thanks,
Robert
-----Original Message----- From: Ariel T. Glenn [mailto:ariel@wikimedia.org] Sent: Thursday, January 10, 2013 10:50 AM To: Robert Crowe Cc: xmldatadumps-l@lists.wikimedia.org Subject: RE: [Xmldatadumps-l] Which files do I need?
You want the page_props table, and look for entries with the string 'hiddencat' for pp_propname. (*-page_props.sql.gz)
Ariel
Στις 10-01-2013, ημέρα Πεμ, και ώρα 09:58 -0800, ο/η Robert Crowe
έγραψε:
Perfect! Thanks Ariel. What is the best way to distinguish hidden categories? I see that the category table used to have a cat_hidden column, but that's been removed.
Robert
-----Original Message-----
From: Ariel T. Glenn [ mailto:ariel@wikimedia.org mailto:ariel@wikimedia.org]
Sent: Thursday, January 10, 2013 3:34 AM
To: Robert Crowe
Cc: mailto:xmldatadumps-l@lists.wikimedia.org xmldatadumps-l@lists.wikimedia.org
Subject: Re: [Xmldatadumps-l] Which files do I need?
If you are just trying to get at the structure from the various dump
files, the page table has page ids, titles, and whether the page is a
redirect or not (*-page.sql.gz), the category table has category
names, ids, and summary information (*-category.sql.gz), and
categorylinks has the list of all category links in a page, with the
page id and the category name (*-categorylinks.sql.gz). You can find
details on the tables here:
http://www.mediawiki.org/wiki/Manual:Categorylinks_table http://www.mediawiki.org/wiki/Manual:Categorylinks_table
(here's the category:
http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables )
Hopefully this should get you started.
Ariel
Στις 09-01-2013, ημέρα Τετ, και ώρα 10:51 -0800, ο/η Robert Crowe
έγραψε:
I'd like to mirror just the category structure of the English
Wikipedia, and I'm wondering which of the dump files I need to start
with.
I don't need the page content, just the page names, and only for the
most current revision. I need the categories and category members,
and I'd like to exclude hidden categories. I also need to
distinguish redirects, because I don't want to treat them as
separate pages. As much as possible I'd like to work with SQL
files, but I can crunch through XML if necessary.
So which files do I need to download? I may also need some help in
understanding the schemas.
Thanks,
Robert
Xmldatadumps-l mailing list
mailto:Xmldatadumps-l@lists.wikimedia.org Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org