I'd be interested in helping if we could generalise it!
You can probably get a substantial speed improvement in C or C++. C and C++
are generaliseable to Python and R, our primary working languages for
analytics. And R lacks any kind of text diffing engine, so I've been
distinctly looking into how to build that.
So if we switch langs for performance, win and generaliseability, I'm in ;).
On 13 December 2014 at 12:33, Aaron Halfaker <ahalfaker(a)wikimedia.org>
wrote:
Hey folks,
I've been working on building up a revision diffs service that you'd be
able to listen to or download a dump of revision diffs.
See
https://github.com/halfak/Difference-Engine for my progress on the
live system and
https://github.com/halfak/MediaWiki-Streaming for my
progress developing a Hadoop Streaming primer to generate old diffs[1].
See also
https://github.com/halfak/Deltas for some experimental diff
algorithms developed specifically to track content moves in Wikipedia
revisions.
In the short term, I can share diff datasets. In the near-term, I'm
wondering if you folks would be interested in working on the project with
me. If so, let me know and I'll give you a more complete status update.
1. It turns out that generating diffs is computationally complex, so
generating them in real time is slow and lame. I'm working to generate all
diffs historically using Hadoop and then have a live system listening to
recent changes to keep the data up-to-date[2].
2.
https://github.com/halfak/MediaWiki-events
-Aaron
On Sat, Dec 13, 2014 at 9:16 AM, Ed Summers <ehs(a)pobox.com> wrote:
+1 Yuvi
About a year ago I put together a little program that identified .uk
external links in Wikipedia’s changes for the web archiving folks at the
British Library. Because it needed to fetch the diff for each change I
never pushed it very far, out of concerns for the API traffic. I never
asked though, so good on Max for bringing it up.
Rather than setting up an additional stream endpoint I wonder if it might
be feasible to add a query parameter to the existing one? So, something
like:
http://stream.wikimedia.org/rc?diff=true
//Ed
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l