Hey folks,
I've been working on building up a revision diffs service that you'd be
able to listen to or download a dump of revision diffs.
See
https://github.com/halfak/Difference-Engine for my progress on the live
system and
https://github.com/halfak/MediaWiki-Streaming for my progress
developing a Hadoop Streaming primer to generate old diffs[1]. See also
https://github.com/halfak/Deltas for some experimental diff algorithms
developed specifically to track content moves in Wikipedia revisions.
In the short term, I can share diff datasets. In the near-term, I'm
wondering if you folks would be interested in working on the project with
me. If so, let me know and I'll give you a more complete status update.
1. It turns out that generating diffs is computationally complex, so
generating them in real time is slow and lame. I'm working to generate all
diffs historically using Hadoop and then have a live system listening to
recent changes to keep the data up-to-date[2].
2.
https://github.com/halfak/MediaWiki-events
-Aaron
On Sat, Dec 13, 2014 at 9:16 AM, Ed Summers <ehs(a)pobox.com> wrote:
+1 Yuvi
About a year ago I put together a little program that identified .uk
external links in Wikipedia’s changes for the web archiving folks at the
British Library. Because it needed to fetch the diff for each change I
never pushed it very far, out of concerns for the API traffic. I never
asked though, so good on Max for bringing it up.
Rather than setting up an additional stream endpoint I wonder if it might
be feasible to add a query parameter to the existing one? So, something
like:
http://stream.wikimedia.org/rc?diff=true
//Ed
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l