I'm looking for a dump from English Wikipedia in diff format (i.e. each
entry is the text that was added/deleted since the last edit, rather than
each entry is the current state of the page).
The Summer of Research folks provided a handy guide to how to create such a
dataset from the standard complete dumps here:
http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours for each
dump file--there are currently 158--running on 24 cores). I'm a grad
student in a social science department, and don't have access to extensive
computing power. I've been paying out of pocket for AWS, but this would get
expensive.
There is a diff-format dataset available, but only through April, 2011
(here:
http://dumps.wikimedia.org/other/diffdb/). I'd like to get a
diff-format dataset for January, 2010- March, 2013 (or, for everything up
to March, 2013).
Does anyone know if such a dataset exists somewhere? Any leads or
suggestions would be much appreciated!
Susan