Hello Pywikipedians,
As part of the TREC Knowledge Base Acceleration[1] eval for 2012, I want
to generate snapshots of wikipedia's link graph around certain pages as of
a given date. Does anyone have advice about this? Anyone doing something
similar that I can learn from?
For example, given a wikipedia.Page() instance for
urlname=Takashi_Murakami, call it "central node," I want to iterate
through the Pages returned by getReferences() and only keep those that
ref'ed my central node on a date in the past, like November 30, 2011.
Is the best (only?) way to do this to iterate through previous revisions
of those Pages and verify that they ref'ed the central node before that
date? What about pages that did ref and no longer do? (probly rare?)
Thanks for any pointers!
John
1 -
http://www.mit.edu/~jrf/knowledge-base-acceleration/
--
___________________________
John R. Frank <jrf(a)mit.edu>