From: Susan Biancani <inacnaib@gmail.com>
Subject: [Wiki-research-l] diffdb formatted Wikipedia dump
Date: October 3, 2013 10:06:44 PM PDT
Reply-To: Research into Wikimedia content and communities <wiki-research-l@lists.wikimedia.org>

I'm looking for a dump from English Wikipedia in diff format (i.e. each entry is the text that was added/deleted since the last edit, rather than each entry is the current state of the page).

The Summer of Research folks provided a handy guide to how to create such a dataset from the standard complete dumps here: http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours for each dump file--there are currently 158--running on 24 cores). I'm a grad student in a social science department, and don't have access to extensive computing power. I've been paying out of pocket for AWS, but this would get expensive.

There is a diff-format dataset available, but only through April, 2011 (here: http://dumps.wikimedia.org/other/diffdb/). I'd like to get a diff-format dataset for January, 2010- March, 2013 (or, for everything up to March, 2013).

Does anyone know if such a dataset exists somewhere? Any leads or suggestions would be much appreciated!
Hi Susan,

There is no newer version of the dataset then you have found, that's the bad news. The good news is that the dataset was used with really slow commodity hardware -- what you could do is run it on AWS using a smaller dataset, for example the Dutch Wikipedia and see how long it takes. An alternative solution would be to start thinking (with other researchers and Wikimedia community members) of having a small Hadoop cluster in Labs with only public data. That way you don't need to pay but obviously it will be less performant.   The Analytics has puppet manifests ready that will build an entire hadoop cluster.

The wikimedia-analytics mailinglist is a good place for such a conversation or if you need more hands on help with the diffdb then please com to irc: wikimedia-analytics.

Best,
Diederik