Hello John,
On 21 January 2012 02:56, John R. Frank <jrf(a)mit.edu> wrote:
For example, given a wikipedia.Page() instance for
urlname=Takashi_Murakami, call it "central node," I want to iterate through
the Pages returned by getReferences() and only keep those that ref'ed my
central node on a date in the past, like November 30, 2011.
Is the best (only?) way to do this to iterate through previous revisions
of those Pages and verify that they ref'ed the central node before that
date? What about pages that did ref and no longer do? (probly rare?)
Pywikipedia can only help with the 'a' side of the 'a->b' link - that
is:
the page where the link is stored. It should be able to retrieve old page
sources, and tell you which pages were linked *from* that page.
The inverse, however, is difficult. To do this efficiently, MediaWiki keeps
this data in an extra table, but this table of course only has the most
recent data. This data is in the 'pagelinks' table, which is dumped, but it
is also huge (for enwiki, other wikis are much smaller):
enwiki-latest-pagelinks.sql.gz 4.0G [1]
However, this table is dumped monthly (somewhat), with the oldest version
in november 2009 [2], so it might be possible to use this.
Best regards,
Merlijn van Deen
[1]
http://dumps.wikimedia.org/enwiki/latest/
[2]
http://dumps.wikimedia.org/enwiki/