I'm
looking for a dump from English Wikipedia in diff format (i.e. each
entry is the text that was added/deleted since the last edit, rather
than each entry is the current state of the page).
The Summer of Research folks provided a handy guide to how to create such a dataset from the standard complete dumps here:
http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours
for each dump file--there are currently 158--running on 24 cores). I'm a
grad student in a social science department, and don't have access to
extensive computing power. I've been paying out of pocket for AWS, but
this would get expensive.
There is a diff-format dataset available, but only through April, 2011 (here:
http://dumps.wikimedia.org/other/diffdb/). I'd like to get a diff-format dataset for January, 2010- March, 2013 (or, for everything up to March, 2013).