Susan,

Hmm, seems like that is a funny middle ground, where it's too long to get live - although its probably less than 158 days. I once read an edited 400,000 pages with pywikibot (3 network IO calls per page, read, external API, write) in about 20 days. You would have to make two IO calls (read, getHistory), per userpage. I don't know how many userpages there are, but that might be enough variables to satisfy a system of inequalities that you need.

If you are deadset on using hadoop, maybe you could use the Wikimedia Labs  XGrid https://wikitech.wikimedia.org/wiki/Main_Page.
They have some monster power and it's free for bot operators and other tool runners. Maybe it's also worth asking on there if someone already has wikihadoop set up.


Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023


From: wiki-research-l-bounces@lists.wikimedia.org <wiki-research-l-bounces@lists.wikimedia.org> on behalf of Susan Biancani <inacnaib@gmail.com>
Sent: Tuesday, October 08, 2013 3:28 PM
To: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] diffdb formatted Wikipedia dump
 
Right now, I want all the edits to user pages and user talk pages, 2010-2013. But as I keep going with this project, I may want to expand a bit, so I figured if I was going to run the wikihadoop software, I might as well only do it once.

I'm hesitant to do this via web scraping, because I think it'll take much longer than working with the dump files. However, if you have suggestions on how to get the diffs (or a similar format) efficiently from the dump files, I would definitely love to hear them.

I appreciate the help and advice!


On Mon, Oct 7, 2013 at 10:44 AM, Pierre-Carl Langlais <pierrecarl.langlais@gmail.com> wrote:
I agree with Klein. If you do not need to exploit the entire Wikipedia database, requests through a python scraping library (like Beautiful Soup) are certainly sufficient and easy to set up. With an aleatory algorithm to select the "ids" you can create a fine sample.
PCL

Le 07/10/13 19:31, Klein,Max a écrit :
Hi Susan,

Do you need the entire database diff'd? I.e. all edits ever. Or are you interested in a particular subset of the diffs? It would help to know your purpose.

For instance I am interested in diffs around specific articles for specific dates to study news events. So I calculate the diffs myself using python on page histories rather than the entire database.

Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023


From: wiki-research-l-bounces@lists.wikimedia.org <wiki-research-l-bounces@lists.wikimedia.org> on behalf of Susan Biancani <inacnaib@gmail.com>
Sent: Thursday, October 03, 2013 10:06 PM
To: wiki-research-l@lists.wikimedia.org
Subject: [Wiki-research-l] diffdb formatted Wikipedia dump
 
I'm looking for a dump from English Wikipedia in diff format (i.e. each entry is the text that was added/deleted since the last edit, rather than each entry is the current state of the page).

The Summer of Research folks provided a handy guide to how to create such a dataset from the standard complete dumps here: http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours for each dump file--there are currently 158--running on 24 cores). I'm a grad student in a social science department, and don't have access to extensive computing power. I've been paying out of pocket for AWS, but this would get expensive.

There is a diff-format dataset available, but only through April, 2011 (here: http://dumps.wikimedia.org/other/diffdb/). I'd like to get a diff-format dataset for January, 2010- March, 2013 (or, for everything up to March, 2013).

Does anyone know if such a dataset exists somewhere? Any leads or suggestions would be much appreciated!

Susan


_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l