Susan,
Hmm, seems like that is a funny middle ground, where it's too long to get live - although its probably less than 158 days. I once read an edited 400,000 pages with pywikibot (3 network IO calls per page, read, external API, write) in about 20 days. You would
have to make two IO calls (read, getHistory), per userpage. I don't know how many userpages there are, but that might be enough variables to satisfy a system of inequalities that you need.
If you are deadset on using hadoop, maybe you could use the Wikimedia Labs XGrid
https://wikitech.wikimedia.org/wiki/Main_Page.
They have some monster power and it's free for bot operators and other tool runners. Maybe it's also worth asking on there if someone already has wikihadoop set up.
Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023
From: wiki-research-l-bounces@lists.wikimedia.org <wiki-research-l-bounces@lists.wikimedia.org> on behalf of Susan Biancani <inacnaib@gmail.com>
Sent: Tuesday, October 08, 2013 3:28 PM
To: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] diffdb formatted Wikipedia dump
Right now, I want all the edits to user pages and user talk pages, 2010-2013. But as I keep going with this project, I may want to expand a bit, so I figured if I was going to run the wikihadoop software, I might as well only do it once.
I'm hesitant to do this via web scraping, because I think it'll take much longer than working with the dump files. However, if you have suggestions on how to get the diffs (or a similar format) efficiently from the dump files, I would definitely love to hear
them.
I appreciate the help and advice!