On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarling@wikimedia.org wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
AWESOME. This is so cool. I've copied the research list too, since there's many Wikipedia historians that will be eager to see the older versions.
I hope we can get them up in a browsable way, like nostalgia.wikipedia.org!
-- phoebe