[Foundation-l] Old Wikipedia backups discovered

Jay Walsh jwalsh at wikimedia.org
Tue Dec 14 18:09:39 UTC 2010


This is definitely a tremendous asset leading up to our big bday in January. I hope we can extract and post some of the real gems.  

Thanks for the resourcefulness and the sharing, Tim.

On Dec 14, 2010, at 10:04 AM, phoebe ayers wrote:

> On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling <tstarling at wikimedia.org> wrote:
>> I was looking through some old files in our SourceForge project. I
>> opened a file called wiki.tar.gz, and inside were three complete
>> backups of the text of Wikipedia, from February, March and August 2001!
>> 
>> This is exciting, because there is lots of article history in here
>> which was assumed to be lost forever.
>> 
>> I've long been interested in Wikipedia's history, and I've tried in
>> the past to locate such backups. I asked various people who might have
>> had one. I had given up hope.
>> 
>> The history of particularly old Wikipedia articles, as seen in the
>> present Wikipedia database, is incomplete, due to Usemod's policy of
>> deleting old revisions of pages after about a month. The script which
>> Brion wrote to import the article histories from UseMod to MediaWiki
>> only fetched those revisions which hadn't been purged yet.
>> 
>> I didn't want to believe that those revisions had been lost forever,
>> and I even opened the UseMod source code and stared forlornly at the
>> unlink() call. What I (and Brion before) missed is that UseMod appends
>> a record of every change made to two files, called diff_log and rclog.
>> In these two files is a record of every change made to Wikipedia from
>> January 15 to August 17, 2001.
>> 
>> I've put the two log files up on the web, at:
>> 
>> http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
>> 
>> The 7-zip archive is only 8.4MB -- much more manageable than today's
>> backups.
>> 
>> rclog contains IP addresses. The Usemod software made IP addresses of
>> logged-in users public, so the people who made these edits had no
>> expectation that their IP address would be kept private. That, coupled
>> with the passage of time, makes me think that no harm to user privacy
>> can come from releasing these files.
>> 
>> -- Tim Starling
> 
> AWESOME. This is so cool. I've copied the research list too, since
> there's many Wikipedia historians that will be eager to see the older
> versions.
> 
> I hope we can get them up in a browsable way, like nostalgia.wikipedia.org!
> 
> -- phoebe
> 
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- 
Jay Walsh
Head of Communications
WikimediaFoundation.org
blog.wikimedia.org
+1 (415) 839 6885 x 609, @jansonw




More information about the foundation-l mailing list