Hi Tilman, thanks for pointing to this research. We have indeed worked on this kind of project, for both ORES and the WikiCredit system. There are many challenges like memory and processing time. Loading the entire history without text is what we're working on right now for our Wikistats 2.0 project. Even this has many challenges.
Detailed technical report on an undergraduate student project at Virginia Tech (work in progress) to import the entire English Wikipedia history dump into the university's Hadoop cluster and index it using Apache Solr, to "allow researchers and developers at Virginia Tech to benchmark configurations and big data analytics software":
Steven Stulga, "English Wikipedia on Hadoop Cluster"
https://vtechworks.lib.vt.edu/handle/10919/70932 (CC BY 3.0)
IIRC this has rarely or never been attempted due to the large size of the dataset - 10TB uncompressed. And it looks like the author here encountered an out of memory error that he wasn't able to solve before the end of term...
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
--
Sent from Gmail Mobile