Hello Pywikipedians,
As part of the TREC Knowledge Base Acceleration[1] eval for 2012, I want to generate snapshots of wikipedia's link graph around certain pages as of a given date. Does anyone have advice about this? Anyone doing something similar that I can learn from?
For example, given a wikipedia.Page() instance for urlname=Takashi_Murakami, call it "central node," I want to iterate through the Pages returned by getReferences() and only keep those that ref'ed my central node on a date in the past, like November 30, 2011.
Is the best (only?) way to do this to iterate through previous revisions of those Pages and verify that they ref'ed the central node before that date? What about pages that did ref and no longer do? (probly rare?)
Thanks for any pointers!
John
1 - http://www.mit.edu/~jrf/knowledge-base-acceleration/
-- ___________________________ John R. Frank jrf@mit.edu