On Dec 13, 2014 12:33 PM, "Aaron Halfaker" <ahalfaker(a)wikimedia.org>
wrote:
1. It turns out that generating diffs is
computationally complex, so
generating them in real time is slow and lame. I'm
working to generate all
diffs historically using Hadoop and then have a live system listening to
recent changes to keep the data up-to-date[2].
IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for
all enwiki diffs for all time. (don't remember if this is namespace
limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB
AIUI, there's no dynamic memory allocation. revisions are loaded into
fixed-size buffers larger than the largest revision.
https://github.com/makoshark/wikiq
-Jeremy