On Wed, Oct 10, 2012 at 2:15 PM, Brian Keegan <bkeegan(a)gmail.com> wrote:
Hi all,
I'm trying to scrape some data from en.wiki about the outlinks from the body
of articles. However, the API returns article outlinks contained within
templates. While I can write a routine to get a list of all the templates
and identify the article links inside these templates to remove from the
outlinks, this is problematic if a link appears in both the body and a
template. Thus if article X has a link to Y in the body as well as links to
Y an Z in templates, I want to capture Y but not Y & Z.
Ideally, I'd like to either (1) be able to count the number of times an
article links out to another article (if X links to Y twice) and then
iterate this count down for each appearance in a template or (2) count only
the links occurring in the body and not parsing the links in templates.
Thank you in advance for your suggestions!
Neither of these things is supported by the API, because the
underlying functionality in MediaWiki (the links tables and the
ParserOutput metadata) doesn't provide or store this information. You
would have to do some kind of processing of your own to get this
information.
Roan