Hey folks,

I've been working on building up a revision diffs service that you'd be able to listen to or download a dump of revision diffs.

See https://github.com/halfak/Difference-Engine for my progress on the live system and https://github.com/halfak/MediaWiki-Streaming for my progress developing a Hadoop Streaming primer to generate old diffs[1].  See also https://github.com/halfak/Deltas for some experimental diff algorithms developed specifically to track content moves in Wikipedia revisions. 

In the short term, I can share diff datasets.  In the near-term, I'm wondering if you folks would be interested in working on the project with me.  If so, let me know and I'll give you a more complete status update. 

1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame.  I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2].
2. https://github.com/halfak/MediaWiki-events

-Aaron

On Sat, Dec 13, 2014 at 9:16 AM, Ed Summers <ehs@pobox.com> wrote:
+1 Yuvi

About a year ago I put together a little program that identified .uk external links in Wikipedia’s changes for the web archiving folks at the British Library. Because it needed to fetch the diff for each change I never pushed it very far, out of concerns for the API traffic. I never asked though, so good on Max for bringing it up.

Rather than setting up an additional stream endpoint I wonder if it might be feasible to add a query parameter to the existing one? So, something like:

    http://stream.wikimedia.org/rc?diff=true

//Ed

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l