[Foundation-l] Old Wikipedia backups discovered

Tim Starling tstarling at wikimedia.org
Tue Dec 14 15:54:04 UTC 2010


I was looking through some old files in our SourceForge project. I
opened a file called wiki.tar.gz, and inside were three complete
backups of the text of Wikipedia, from February, March and August 2001!

This is exciting, because there is lots of article history in here
which was assumed to be lost forever.

I've long been interested in Wikipedia's history, and I've tried in
the past to locate such backups. I asked various people who might have
had one. I had given up hope.

The history of particularly old Wikipedia articles, as seen in the
present Wikipedia database, is incomplete, due to Usemod's policy of
deleting old revisions of pages after about a month. The script which
Brion wrote to import the article histories from UseMod to MediaWiki
only fetched those revisions which hadn't been purged yet.

I didn't want to believe that those revisions had been lost forever,
and I even opened the UseMod source code and stared forlornly at the
unlink() call. What I (and Brion before) missed is that UseMod appends
a record of every change made to two files, called diff_log and rclog.
In these two files is a record of every change made to Wikipedia from
January 15 to August 17, 2001.

I've put the two log files up on the web, at:

http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z

The 7-zip archive is only 8.4MB -- much more manageable than today's
backups.

rclog contains IP addresses. The Usemod software made IP addresses of
logged-in users public, so the people who made these edits had no
expectation that their IP address would be kept private. That, coupled
with the passage of time, makes me think that no harm to user privacy
can come from releasing these files.

-- Tim Starling




More information about the wikimedia-l mailing list