While we (Research & Data @ WMF) consider maintaining a standard dump of categories and pagelinks, anyone can pull such a dataset from tool labs slaves (see http://tools.wmflabs.org) with the following two queries.

/* Get all page links */

SELECT

origin.page_id AS from_id,

origin.page_namespace AS from_namespace,

origin.page_title AS from_title,

dest.page_id AS to_id, /* NULL if page doesn't exists */

pl_namespace AS to_namespace,

pl_title AS to_title

FROM pagelinks

LEFT JOIN page origin ON origin.page_id = pl_from

LEFT JOIN page dest ON dest.page_namespace = pl_namespace AND dest.page_title = pl_title;

/* Get all category links */

SELECT

origin.page_id AS from_id,

origin.page_namespace AS from_namespace,

origin.page_title AS from_title,

cl_to AS category_title

FROM categorylinks

LEFT JOIN page origin ON page_id = cl_from;

Note that these tables are very large. For English Wikipedia, pagelinks contains ~900 million rows and categorylinks contains ~66 million rows.

-Aaron

On Mon, Dec 16, 2013 at 11:28 AM, Giovanni Luca Ciampaglia <glciampagl@gmail.com> wrote:

+1

Same here, also, using a standardized dataset would make much easier to reproduce others' work.

G

On Sun 15 Dec 2013 05:19:54 AM EST, Carlos Castillo wrote:

Hi,

I think this is definitively a great idea which will save lots of
researchers a ton of work.

Cheers,

--
Giovanni Luca Ciampaglia

Postdoctoral fellow
Center for Complex Networks and Systems Research
Indiana University

✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
☞ http://cnets.indiana.edu/
✉ gciampag@indiana.edu
✆ 1-812-855-7261

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l