Hi, I wasn't sure whether this was the appropriate mailing list for this question - if not, pointers to the correct one would be appreciated.
I would like to retrieve pages that contain, say, a DrugBox. The following URL lists all pages that contain this info box
http://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Template:Dru...
What I'd like to do is then do a bulk export of these pages. As far as I can tell, the Export options require that one provide article titles. Furthermore, for some other infoboxes I have to page through the results. Instead I'd like to do this programmatically.
The obvious solution would be to load Wikipedia into a local MySQL DB and then perform the queries directly. But I'm interested in a rather small subset of Wikipedia and loading the whole thing locally seems overkill.
Is there a way I could export the articles containing Drugboxes or do I need to install Wikipedia locally?
Thanks,
On 05/01/11 10:03, Rajarshi Guha wrote:
Is there a way I could export the articles containing Drugboxes or do I need to install Wikipedia locally?
The best way to do it would be to get the list of articles using the API:
http://www.mediawiki.org/wiki/API
If that's too hard, you could could download templatelinks.sql.gz from
http://download.wikimedia.org/enwiki/latest/
and load them into a MySQL database, and then use that to get the list of articles. But it's a big file and it's out of date. Either way, you should get a list of articles and then download them in small batches (say 10 articles at a time) using Special:Export. This may require a small amount of scripting.
-- Tim Starling
On Wed, Jan 5, 2011 at 7:06 AM, Tim Starling tstarling@wikimedia.org wrote:
On 05/01/11 10:03, Rajarshi Guha wrote:
Is there a way I could export the articles containing Drugboxes or do I need to install Wikipedia locally?
The best way to do it would be to get the list of articles using the API:
Thanks for the pointer. I've been trying this, but can't seem to get the same information as a manual approach.
For example, viewing
http://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Template:Dru...
gives me a page listing pages that link to the Template:Drugbox page.
However, if I construct an API call:
http://en.wikipedia.org/w/api.php?action=query&list=backlinks&bltitl...
I get no results.
If I drop the blnamespace parameter I get a set of links, but as far as I can tell, none of them are in the Main namespace.
I must be missing something obvious and any pointers to getting it to work would be much appreciated
Thanks,
On 10/01/11 03:40, Rajarshi Guha wrote:
However, if I construct an API call:
http://en.wikipedia.org/w/api.php?action=query&list=backlinks&bltitl...
I get no results.
Use list=embeddedin not list=backlinks.
To continue the query, use eicontinue:
-- Tim Starling