New subject: Revert detection

22 Aug 2011

      I've updated my dump processing python project to include code for quickly
detecting identity reverts from XML dumps.  See
https://bitbucket.org/halfak/wikimedia-utilities for the project and the
process() function at bottom of
https://bitbucket.org/halfak/wikimedia-utilities/src/f1c8fe7224f3/wmf/dump/p...
for
the algorithm.  The actual function with the revert detection logic is about
50 lines long.
The resulting dump.map function using this revert processor() will emit
"revert" revisions and "reverted" revisions with the following fields:
Revert revision:
- "revert" - denotes that this row is a reverting edit
   - revision_id - the rev_id if the reverting edit
   - reverted_to_id - the rev_id of the reverted to edit
   - for_vandalism - using D_LOOSE/D_STRICT regular expression on the
   reverting comment (See Priedhorsky et al. "Creating, Destroying and
   Restoring Value in Wikipedia" GROUP 2007)
   - reverted_revs - number of revisions that were reverted (this is the
   number of revisions between the reverting edit and reverted to edit)
Reverted revision:
- "reverted" - denotes that this row is a reverted edit
   - revision_id - the rev_id of the reverted edit
   - reverting_id - the rev_id if the reverting edit
   - reverted_to_id - the rev_id of the reverted to edit
   - for_vandalism - using D_LOOSE/D_STRICT regular expression on the
   reverting comment (See Priedhorsky et al. "Creating, Destroying and
   Restoring Value in Wikipedia" GROUP 2007)
   - reverted_revs - number of revisions that were reverted (this is the
   number of revisions between the reverting edit and reverted to edit)
I hope this is helpful.
-Aaron
On Fri, Aug 19, 2011 at 3:08 PM, Aaron Halfaker aaron.halfaker@gmail.comwrote:
...
An identity revert is one which changes the article to an absolutely
identical previous state.  This is a common operation in the English
Wikipedia.
There is a Kittur & Kraut (and others) paper which I can't recall that
found the vast majority of reverts of any sort were identity.  Some other
types the define are:

"Partial reverts": Part of an edit is discarded
"Effective reverts": Looks to be an identity revert, but not

*exactly* the same as a previous revision.  Often a few white-space
   characters were out of place.
See http://www.grouplens.org/node/427 for a discussion of the difficulty
of detecting reverts in better ways.
My code detects identity reverts.  For example suppose the following is the
content of a sequence of revisions.

"foo"
"bar"
"foobar"
"bar"
"barbar"

Revision #4 reverts back to revision #2 and revision #3 is reverted.  When
looking for identity reverts, I have found that limiting the number of
revisions that can be reverted to ~15 produces the highest quality of
results.  This is discussed in http://www.grouplens.org/node/416 (see
http://www-users.cs.umn.edu/~halfak/summaries/A_Jury_of_Your_Peers.html for
quick/dirty summary of the work.).
This subject deserves a long conversation, but I think the bit you might be
interested in is that the identity revert (described above and example)
seems to be the accepted approach for identifying reverts for most types of
analyses.
-Aaron
On Fri, Aug 19, 2011 at 4:39 PM, Flöck, Fabian fabian.floeck@kit.eduwrote:
...
Hi Aaron,
thanks, that would be awesome :) we built something ourselves, but I'm not
quite content with it.
Could you also tell me how you defined a revert (and maybe how you
determine who is the reverter)? Because this is a crucial issue for me.
Is it the complete deletion of all the characters entered by an editor in
an edit? What about editors that revert others or delete content? do you
treat their edits as being reverted if the deleted content gets
reintroduced? Did you take into account location of the words in the text or
did you use a bag-of-words model?
I read many papers and tool documentations that use "reverts", and some
mention their method (while many don't), while it seems almost no-one
describes their definition of what a "revert" actually is.
But maybe I will get the answers to this from your code as well :)
Anyway, thanks for the help!
Best,
Fabian
On 19 Aug 2011, at 18:31, Aaron Halfaker wrote:
Fabian,
I actually have some software for quickly producing reverts from a
database dump.  The framework for doing it is available here:
https://bitbucket.org/halfak/wikimedia-utilities.  I still have to
package up the code that actually generates the reverts though.  It's just a
matter of finding time to sit down with it and figure out the dependencies!
 I expect that I can have it ready by Monday.  I hope to actually package up
the revert detecting code into the above python project as an example.
I just wanted to let you know that I have a response for you on the way.
-Aaron
On Thu, Aug 18, 2011 at 4:40 AM, Flöck, Fabian fabian.floeck@kit.eduwrote:
...
Hi,
I'm trying to detect reverts in Wikipedia for my research, right now with
a self-built script using MD5hashes and DIFFs between revisions. I always
read about people taking reverts into account in their data, but it's
seldomly described HOW exactly a revert is determined or what tool they use
to do that. Can you point me to any research or tools or tell me maybe what
you used in your own research to identify which edits were reverted and/or
who reverted them?
Best,
Fabian
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu
WWW: http://www.aifb.kit.edu/web/Fabian_Fl%C3%B6ck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association

Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu
WWW: http://www.aifb.kit.edu/web/Fabian_Fl%C3%B6ck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association

Re: [Wiki-research-l] Revert detection