Right now, I want all the edits to user pages and user talk pages,
2010-2013. But as I keep going with this project, I may want to expand a
bit, so I figured if I was going to run the wikihadoop software, I might as
well only do it once.
I'm hesitant to do this via web scraping, because I think it'll take much
longer than working with the dump files. However, if you have suggestions
on how to get the diffs (or a similar format) efficiently from the dump
files, I would definitely love to hear them.
I appreciate the help and advice!
On Mon, Oct 7, 2013 at 10:44 AM, Pierre-Carl Langlais <
pierrecarl.langlais(a)gmail.com> wrote:
I agree with Klein. If you do not need to exploit the
entire Wikipedia
database, requests through a python scraping library (like Beautiful Soup)
are certainly sufficient and easy to set up. With an aleatory algorithm to
select the "ids" you can create a fine sample.
PCL
Le 07/10/13 19:31, Klein,Max a écrit :
Hi Susan,
Do you need the entire database diff'd? I.e. all edits ever. Or are you
interested in a particular subset of the diffs? It would help to know your
purpose.
For instance I am interested in diffs around specific articles for
specific dates to study news events. So I calculate the diffs myself using
python on page histories rather than the entire database.
Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023
------------------------------
*From:* wiki-research-l-bounces(a)lists.wikimedia.org
<wiki-research-l-bounces@lists.wikimedia.org><wiki-research-l-bounces@lists.wikimedia.org>on
behalf of Susan Biancani
<inacnaib(a)gmail.com> <inacnaib(a)gmail.com>
*Sent:* Thursday, October 03, 2013 10:06 PM
*To:* wiki-research-l(a)lists.wikimedia.org
*Subject:* [Wiki-research-l] diffdb formatted Wikipedia dump
I'm looking for a dump from English Wikipedia in diff format (i.e.
each entry is the text that was added/deleted since the last edit, rather
than each entry is the current state of the page).
The Summer of Research folks provided a handy guide to how to create such
a dataset from the standard complete dumps here:
http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours for
each dump file--there are currently 158--running on 24 cores). I'm a grad
student in a social science department, and don't have access to extensive
computing power. I've been paying out of pocket for AWS, but this would get
expensive.
There is a diff-format dataset available, but only through April, 2011
(here:
http://dumps.wikimedia.org/other/diffdb/). I'd like to get a
diff-format dataset for January, 2010- March, 2013 (or, for everything up
to March, 2013).
Does anyone know if such a dataset exists somewhere? Any leads or
suggestions would be much appreciated!
Susan
_______________________________________________
Wiki-research-l mailing
listWiki-research-l@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l