Re: [Wiki-research-l] Distributing the Wikipedia category/pagelink graph

16 Dec 2013

While we (Research & Data @ WMF) consider maintaining a standard dump of
categories and pagelinks, anyone can pull such a dataset from tool labs
slaves (see http://tools.wmflabs.org) with the following two queries.

*/* Get all page links */*
SELECT
  origin.page_id AS from_id,
  origin.page_namespace AS from_namespace,
  origin.page_title AS from_title,
  dest.page_id AS to_id, */* NULL if page doesn't exists */ *
  pl_namespace AS to_namespace,
  pl_title AS to_title
FROM pagelinks
LEFT JOIN page origin ON origin.page_id = pl_from
LEFT JOIN page dest ON dest.page_namespace = pl_namespace AND
dest.page_title = pl_title;

*/* Get all category links */*
SELECT
  origin.page_id AS from_id,
  origin.page_namespace AS from_namespace,
  origin.page_title AS from_title,
  cl_to AS category_title
FROM categorylinks
LEFT JOIN page origin ON page_id = cl_from;

Note that these tables are very large.  For English Wikipedia,
pagelinkscontains ~900 million rows and
categorylinks contains ~66 million rows.

-Aaron

On Mon, Dec 16, 2013 at 11:28 AM, Giovanni Luca Ciampaglia <
glciampagl(a)gmail.com&gt; wrote:

...
  +1

 Same here, also, using a standardized dataset would make much easier to
 reproduce others' work.

 G

 On Sun 15 Dec 2013 05:19:54 AM EST, Carlos Castillo wrote:

  Hi,

 I think this is definitively a great idea which will save lots of
 researchers a ton of work.

 Cheers,

 --
 Giovanni Luca Ciampaglia

 Postdoctoral fellow
 Center for Complex Networks and Systems Research
 Indiana University

 ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
 ☞ http://cnets.indiana.edu/
 ✉ gciampag(a)indiana.edu
 ✆ 1-812-855-7261

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Distributing the Wikipedia category/pagelink graph