Hello Pywikipedians,
As part of the TREC Knowledge Base Acceleration[1] eval for 2012, I want to generate snapshots of wikipedia's link graph around certain pages as of a given date. Does anyone have advice about this? Anyone doing something similar that I can learn from?
For example, given a wikipedia.Page() instance for urlname=Takashi_Murakami, call it "central node," I want to iterate through the Pages returned by getReferences() and only keep those that ref'ed my central node on a date in the past, like November 30, 2011.
Is the best (only?) way to do this to iterate through previous revisions of those Pages and verify that they ref'ed the central node before that date? What about pages that did ref and no longer do? (probly rare?)
Thanks for any pointers!
John
1 - http://www.mit.edu/~jrf/knowledge-base-acceleration/
-- ___________________________ John R. Frank jrf@mit.edu
2012/1/21 John R. Frank jrf@mit.edu
verify that they ref'ed the central node before that date? What about pages that did ref and no longer do? (probly rare?)
I think this information is not kept explicitely, so the option is brute force to gather them if you really need. Alternative idea is to get a dump that is closest to the given date, and use that (but I am not sure that all the dumps are publicly available backwards).
Hello John,
On 21 January 2012 02:56, John R. Frank jrf@mit.edu wrote:
For example, given a wikipedia.Page() instance for urlname=Takashi_Murakami, call it "central node," I want to iterate through the Pages returned by getReferences() and only keep those that ref'ed my central node on a date in the past, like November 30, 2011.
Is the best (only?) way to do this to iterate through previous revisions of those Pages and verify that they ref'ed the central node before that date? What about pages that did ref and no longer do? (probly rare?)
Pywikipedia can only help with the 'a' side of the 'a->b' link - that is: the page where the link is stored. It should be able to retrieve old page sources, and tell you which pages were linked *from* that page.
The inverse, however, is difficult. To do this efficiently, MediaWiki keeps this data in an extra table, but this table of course only has the most recent data. This data is in the 'pagelinks' table, which is dumped, but it is also huge (for enwiki, other wikis are much smaller):
enwiki-latest-pagelinks.sql.gz 4.0G [1]
However, this table is dumped monthly (somewhat), with the oldest version in november 2009 [2], so it might be possible to use this.
Best regards, Merlijn van Deen
[1] http://dumps.wikimedia.org/enwiki/latest/ [2] http://dumps.wikimedia.org/enwiki/