Hi, you can also look at our WikiWho code, we have tested it to extract the changes
between revisions considerably faster than a simple diff. see here:
https://github.com/maribelacosta/wikiwho . you would have to adapt the code a bit to give
you the pure diffs though. let me know if you need help.
best,
fabian
On 20.01.2016, at 13:15, Scott Hale
<computermacgyver@gmail.com<mailto:computermacgyver@gmail.com>> wrote:
Hi Bowen,
You might compare the performance of Aaron Halfaker's deltas library:
https://github.com/halfak/deltas
(You might have already done so, I guess, but just in case)
In either case, I suspect the tasks will need to be parallelized to be achieved in a
reasonable time scale. How many editions are you working with?
Cheers,
Scott
On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu
<yuxxx856@umn.edu<mailto:yuxxx856@umn.edu>> wrote:
Hello all,
I am a 2nd PhD student working in Grouplens Research group at the University of Minnesota
- Twin Cities. Recently, I am working on a project to study how identity based and bond
based theories would help understand editor's behavior in WikiProjects within the
group context, but I am having a technical problems that need help and advise.
I am trying to parse each revision content of the editors from the XML dumps - the
contents they added or deleted in each revision. I used the compare function in difflib to
obtain the added or deleted contents by comparing two string objects, which runs extremely
slow when the strings are huge specifically in the case of the Wikipedia revision
contents. Without any parallel processing techniques, the expecting runtime to download
and parse the 201 dumps would be ~100+ days.. I was pointed to altiscale, but not yet sure
exactly how to use it for my problem.
It would be really great if anyone would give me some suggestion to help me make more
progress. Thanks in advance!
Sincerely,
Bowen
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Dr Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.hale@oii.ox.ac.uk<mailto:scott.hale@oii.ox.ac.uk>
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Gruß,
Fabian
--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
fabian.floeck@gesis.org<mailto:fabian.floeck@gesis.org>
www.gesis.org
www.facebook.com/gesis.org