Hi Susan,
Do you need the entire database diff'd? I.e. all edits ever. Or are you interested in
a particular subset of the diffs? It would help to know your purpose.
For instance I am interested in diffs around specific articles for specific dates to study
news events. So I calculate the diffs myself using python on page histories rather than
the entire database.
Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023
________________________________
From: wiki-research-l-bounces(a)lists.wikimedia.org
<wiki-research-l-bounces(a)lists.wikimedia.org> on behalf of Susan Biancani
<inacnaib(a)gmail.com>
Sent: Thursday, October 03, 2013 10:06 PM
To: wiki-research-l(a)lists.wikimedia.org
Subject: [Wiki-research-l] diffdb formatted Wikipedia dump
I'm looking for a dump from English Wikipedia in diff format (i.e. each entry is the
text that was added/deleted since the last edit, rather than each entry is the current
state of the page).
The Summer of Research folks provided a handy guide to how to create such a dataset from
the standard complete dumps here:
http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours for each dump
file--there are currently 158--running on 24 cores). I'm a grad student in a social
science department, and don't have access to extensive computing power. I've been
paying out of pocket for AWS, but this would get expensive.
There is a diff-format dataset available, but only through April, 2011 (here:
http://dumps.wikimedia.org/other/diffdb/). I'd like to get a diff-format dataset for
January, 2010- March, 2013 (or, for everything up to March, 2013).
Does anyone know if such a dataset exists somewhere? Any leads or suggestions would be
much appreciated!
Susan