[Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

20 Jan 2016


      Hello all,
I am a 2nd PhD student working in Grouplens Research group at the
University of Minnesota - Twin Cities. Recently, I am working on a project
to study how identity based and bond based theories would help understand
editor's behavior in WikiProjects within the group context, but I am having
a technical problems that need help and advise.
I am trying to parse each revision content of the editors from the XML
dumps - the contents they added or deleted in each revision. I used the
compare function in difflib to obtain the added or deleted contents by
comparing two string objects, which runs extremely slow when the strings
are huge specifically in the case of the Wikipedia revision contents.
Without any parallel processing techniques, the expecting runtime to
download and parse the 201 dumps would be ~100+ days.. I was pointed to
altiscale, but not yet sure exactly how to use it for my problem.
It would be really great if anyone would give me some suggestion to help me
make more progress. Thanks in advance!
Sincerely,
Bowen

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps