Miranche wrote:
Greetings Wikitechies,
I'm working on a research project on Wikipedia, and I'd like to create or obtain a historical snapshot of Wikipedia on or about a given date. I'm familiar enough with mediawiki that I could hack my own script recreating the contents of the "cur" table from the corresponding history (eg. by calling getRevisionText() a couple of 10^5 times). However since I'd hate to rediscover the wheel, I'd appreciate if you could let me know if this has been done before, if there are archives of old snapshots, or if there's an easier way to approach it technically.
I seem to remember someone looking into this, I don't know if they completed it.
There are a few difficulties which mean you can't produce a completely accurate past snapshot from a recent dump, in particular:
* Pages which have been renamed may appear at different locations; it's difficult to track back to what the title was at a given past time. * Pages which have been since deleted will not appear at all * In rarer cases, histories have been merged after being accidentally separated during editing, or individual revisions have been removed due to copyright or other legalish issues. * Image file uploads suffer similarly; there are no renames there but older versions may be deleted (usually for vandalism)
You can however extract a reasonable approximation of the contents of the wiki at a given time, depending on your needs.
-- brion vibber (brion @ pobox.com)