On 17/05/11 01:46, lior gimel wrote:
Hi,
Following the discovery of the Wikipedia backups from 2001, and with
much appreciated help from Jospeh Reagle, I created excel files from
the diff_log and rcl_log files that together include all the edits
from January 15th to August 17th (the last in the backups). the
files include the time of edit and the editor's username and IP.
While they are a bit buggy, and could use some more editing (or
preferably, conversion to wiki format), they are simple, easy to
work with (using searches and filters), and full of interesting stuff.
If anyone is interested, I'd be happy to share them.
The problem with using diff_log and rclog alone is that they don't
contain a lot of changes. UseMod allowed administrators to do global
search and replace operations, page renames and deletions, and these
weren't represented in the logs.
Back in December, I reconstructed a lot of these admin operations by
hand, using other data from the UseMod backup to help. I committed my
findings to MediaWiki Subversion in the form of a script, and uploaded
the output as a MediaWiki-format XML dump:
http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-xml.7z
See my mailing list post:
http://article.gmane.org/gmane.science.linguistics.wikipedia.english/107675
Here is the UseMod backup, so you can repeat or extend this work if
you feel like it:
http://noc.wikimedia.org/~tstarling/wikipedia-2001-08-backup.7z
I edited out the admin password, since despite it being a very silly
and obvious password, it's not generally known and somebody might
still be using it for something. There were some tar files inside the
main archive, I expanded them to make it easier to edit the passwords
out of them. Then I recompressed the whole thing with 7zip.
-- Tim Starling