I've updated my dump processing python project to include code for quickly detecting identity reverts from XML dumps. See https://bitbucket.org/halfak/wikimedia-utilities for the project and the process() function at bottom of https://bitbucket.org/halfak/wikimedia-utilities/src/f1c8fe7224f3/wmf/dump/processors/reverts.py for the algorithm. The actual function with the revert detection logic is about 50 lines long.
An identity revert is one which changes the article to an absolutely identical previous state. This is a common operation in the English Wikipedia.There is a Kittur & Kraut (and others) paper which I can't recall that found the vast majority of reverts of any sort were identity. Some other types the define are:
- "Partial reverts": Part of an edit is discarded
- "Effective reverts": Looks to be an identity revert, but not *exactly* the same as a previous revision. Often a few white-space characters were out of place.
See http://www.grouplens.org/node/427 for a discussion of the difficulty of detecting reverts in better ways.My code detects identity reverts. For example suppose the following is the content of a sequence of revisions.Revision #4 reverts back to revision #2 and revision #3 is reverted. When looking for identity reverts, I have found that limiting the number of revisions that can be reverted to ~15 produces the highest quality of results. This is discussed in http://www.grouplens.org/node/416 (see http://www-users.cs.umn.edu/~halfak/summaries/A_Jury_of_Your_Peers.html for quick/dirty summary of the work.).
- "foo"
- "bar"
- "foobar"
- "bar"
- "barbar"
This subject deserves a long conversation, but I think the bit you might be interested in is that the identity revert (described above and example) seems to be the accepted approach for identifying reverts for most types of analyses.-AaronOn Fri, Aug 19, 2011 at 4:39 PM, Flöck, Fabian <fabian.floeck@kit.edu> wrote:
Hi Aaron,thanks, that would be awesome :) we built something ourselves, but I'm not quite content with it.Could you also tell me how you defined a revert (and maybe how you determine who is the reverter)? Because this is a crucial issue for me.Is it the complete deletion of all the characters entered by an editor in an edit? What about editors that revert others or delete content? do you treat their edits as being reverted if the deleted content gets reintroduced? Did you take into account location of the words in the text or did you use a bag-of-words model?I read many papers and tool documentations that use "reverts", and some mention their method (while many don't), while it seems almost no-one describes their definition of what a "revert" actually is.But maybe I will get the answers to this from your code as well :)
Anyway, thanks for the help!Best,FabianOn 19 Aug 2011, at 18:31, Aaron Halfaker wrote:Fabian,I actually have some software for quickly producing reverts from a database dump. The framework for doing it is available here: https://bitbucket.org/halfak/wikimedia-utilities. I still have to package up the code that actually generates the reverts though. It's just a matter of finding time to sit down with it and figure out the dependencies! I expect that I can have it ready by Monday. I hope to actually package up the revert detecting code into the above python project as an example.I just wanted to let you know that I have a response for you on the way.-AaronOn Thu, Aug 18, 2011 at 4:40 AM, Flöck, Fabian <fabian.floeck@kit.edu> wrote:
Hi,
I'm trying to detect reverts in Wikipedia for my research, right now with a self-built script using MD5hashes and DIFFs between revisions. I always read about people taking reverts into account in their data, but it's seldomly described HOW exactly a revert is determined or what tool they use to do that. Can you point me to any research or tools or tell me maybe what you used in your own research to identify which edits were reverted and/or who reverted them?
Best,
Fabian
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu
WWW: http://www.aifb.kit.edu/web/Fabian_Flöck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu
WWW: http://www.aifb.kit.edu/web/Fabian_Flöck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association