On 21 January 2012 02:56, John R. Frank
<jrf@mit.edu> wrote:
For example, given a wikipedia.Page() instance for urlname=Takashi_Murakami, call it "central node," I want to iterate through the Pages returned by getReferences() and only keep those that ref'ed my central node on a date in the past, like November 30, 2011.
Is the best (only?) way to do this to iterate through previous revisions of those Pages and verify that they ref'ed the central node before that date? What about pages that did ref and no longer do? (probly rare?)
Pywikipedia can only help with the 'a' side of the 'a->b' link - that is: the page where the link is stored. It should be able to retrieve old page sources, and tell you which pages were linked *from* that page.
The inverse, however, is difficult. To do this efficiently, MediaWiki keeps this data in an extra table, but this table of course only has the most recent data. This data is in the 'pagelinks' table, which is dumped, but it is also huge (for enwiki, other wikis are much smaller):
enwiki-latest-pagelinks.sql.gz 4.0G [1]
However, this table is dumped monthly (somewhat), with the oldest version in november 2009 [2], so it might be possible to use this.
Best regards,
Merlijn van Deen