Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

20 Jan 2016

The deltas library implements the rough WikiWho strategy in a difflib sort
of way as "SegmentMatcher".

Re. diffs, I have some datasets that I have generated and can share.  Would
enwiki-20150602 be recent enough for your uses?

If not, then I'd also like to point you to http://pythonhosted.org/mwdiffs/
which provides some nice utilities for parallel processing diffs from
MediaWiki dumps using the `deltas` library.  See
http://pythonhosted.org/mwdiffs/utilities.html.  Those utilities will
natively parallelize computation so that you can divide the total runtime
(100 days) by how many CPUs you have to run with.  E.g. 100 days / 16 CPUs
= 6.3 days.   On a hadoop streaming setup (Altiscale), I've been able to
get the whole English Wikipedia history processed in 48 hours, so it's not
a massive benefit -- yet.

-Aaron

On Wed, Jan 20, 2016 at 8:49 AM, Flöck, Fabian &lt;Fabian.Floeck(a)gesis.org&gt;
wrote:

...
  Hi, you can also look at our WikiWho code, we have
tested it to extract
 the changes between revisions considerably faster than a simple diff. see
 here: https://github.com/maribelacosta/wikiwho . you would have to adapt
 the code a bit to give you the pure diffs though. let me know if you need
 help.

 best,
 fabian

 On 20.01.2016, at 13:15, Scott Hale &lt;computermacgyver(a)gmail.com&gt; wrote:

 Hi Bowen,

 You might compare the performance of Aaron Halfaker's deltas library:
 https://github.com/halfak/deltas
 (You might have already done so, I guess, but just in case)

 In either case, I suspect the tasks will need to be parallelized to be
 achieved in a reasonable time scale. How many editions are you working with?

 Cheers,
 Scott

 On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu &lt;yuxxx856(a)umn.edu&gt; wrote:

  Hello all,

 I am a 2nd PhD student working in Grouplens Research group at the
 University of Minnesota - Twin Cities. Recently, I am working on a project
 to study how identity based and bond based theories would help understand
 editor's behavior in WikiProjects within the group context, but I am having
 a technical problems that need help and advise.

 I am trying to parse each revision content of the editors from the XML
 dumps - the contents they added or deleted in each revision. I used the
 compare function in difflib to obtain the added or deleted contents by
 comparing two string objects, which runs extremely slow when the strings
 are huge specifically in the case of the Wikipedia revision contents.
 Without any parallel processing techniques, the expecting runtime to
 download and parse the 201 dumps would be ~100+ days.. I was pointed to
 altiscale, but not yet sure exactly how to use it for my problem.

 It would be really great if anyone would give me some suggestion to help
 me make more progress. Thanks in advance!

 Sincerely,
 Bowen

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 --
 Dr Scott Hale
 Data Scientist
 Oxford Internet Institute
 University of Oxford
 http://www.scotthale.net/
 scott.hale(a)oii.ox.ac.uk
 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 Gruß,
 Fabian

 --
 Fabian Flöck
 Research Associate
 Computational Social Science department @GESIS
 Unter Sachsenhausen 6-8, 50667 Cologne, Germany
 Tel: + 49 (0) 221-47694-208
 fabian.floeck(a)gesis.org

 www.gesis.org
 www.facebook.com/gesis.org

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps