Hi!
What more do you have in mind that could be in "augmented stream" than
the current RCstream data + difffs as they are provided by the API?
Mitar
On Mon, Dec 15, 2014 at 10:22 PM, Maximilian Klein <isalix(a)gmail.com> wrote:
All,
Thanks for the great responses. It seems like, Andrew, Ed, DataSift, and
Mitar are now all offering overlapping solutions to the real-time diff
monitoring problem. The one thing I take away from that is that if the API
is robust enough to serve these 4 clients in real time, then adding another
is a drop in the bucket.
However, as others like Yuvi pointed out, and Aaron has prototyped we could
make this better, by serving an augmented RCstream. I wonder how easy it
would be to allow community development on that project since it seems that
it would require access to the full databases, which only WMF developers
seem to have access to at the moment.
Make a great day,
Max Klein ‽
http://notconfusing.com/
On Mon, Dec 15, 2014 at 5:09 AM, Flöck, Fabian <Fabian.Floeck(a)gesis.org>
wrote:
If anyone is interested in a faster processing of revision differences,
you could also adapt the strategy we implemented for wikiwho [1], which is
keeping track of bigger unchanged text chunks with hashes and just diffing
the remaining text (usually a relatively small part oft the article). We
specifically introduced that technique because diffing all the text was too
expensive. And in principle, it can produce the same output, although we
currently use it for authorship detection, which is a slightly different
task. Anyway, it is on average >100 times faster than pure "traditional"
diffing. Maybe that is useful for someone. Code is available at github [2].
[1]
http://f-squared.org/wikiwho
[2]
https://github.com/maribelacosta/wikiwho
On 14.12.2014, at 07:23, Jeremy Baron <jeremy(a)tuxmachine.com> wrote:
On Dec 13, 2014 12:33 PM, "Aaron Halfaker" <ahalfaker(a)wikimedia.org>
wrote:
1. It turns out that generating diffs is
computationally complex, so
generating them in real time is slow and lame. I'm working to generate all
diffs historically using Hadoop and then have a live system listening to
recent changes to keep the data up-to-date[2].
IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for
all enwiki diffs for all time. (don't remember if this is namespace limited)
But also using an extraordinary amount of RAM. i.e. hundreds of GB
AIUI, there's no dynamic memory allocation. revisions are loaded into
fixed-size buffers larger than the largest revision.
https://github.com/makoshark/wikiq
-Jeremy
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Cheers,
Fabian
--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
fabian.floeck(a)gesis.org
www.gesis.org
www.facebook.com/gesis.org
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l