More accurate revert detection in Wikipedia, alternative to MD5 identical revision method - Wiki-research-l

27 Jun 2012

For those of you who are interested in reverts:
I just presented our paper on accurate revert detection at the ACM Hypertext and Social
Media conference 2012, showing a significant accuracy (and coverage) gain compared to the
widely used method of finding identical revisions (via MD5 hash values) to detect reverts,
proving that our method detects edit pairs that are significantly more likely to be actual
reverts according to editors perception of a revert and the Wikipedia definition. 35% of
the reverts found by the MD5 method in our sample are not assessed to be reverts by more
than 80% of our survey participants (accuracy 0%). The provided new method finds different
reverts for these 35% plus 12% more, which show a 70% accuracy.

Find the PDF slides, paper and results here:
http://people.aifb.kit.edu/ffl/reverts/

I'll be happy to answer any questions.

More in detail:
The MD5 hash method employed by many researchers to identify reverts (as some others, like
using edit  comments) is acknowledged to produce some inaccuracies as far as the Wikipedia
definition of a revert ("reverses the actions of any editors", "undoing the
actions"..) is concerned. The extent of these inaccuracies is usually judged to be
not too large, as naturally, most reverting edits are carried out immediately after the
edit to be reverted, being an "identity revert" (Wikipedia definition:
"..normally results in the page being restored to a version that existed
previously"). Still, there has not been a user evaluation assessing how well the
detected reverts conform with the Wikipedia definition and what users actually perceive as
a revert. We developed and evaluated an alternative method to the MD5 identity revert and
show a significant increase in accuracy (and coverage).
34% of the reverts detected by the MD5 hash method in our sample actually fail to be
acknowledged as full reverts by more than 80% of users in our study, while our new method
performs much better, finding different reverts for these 34% wrongly detected reverts
plus 12% more reverts, showing an accuracy of 70% for these newly found edit pairs
actually being reverts according to the users. The increased accuracy performance between
the reverts detected only by the MD5 and only by our new method is highly significant,
while reverts detected by both methods also perform significantly better than those only
detected by the MD5 method.

Trade-off:
Although this method is much slower than the MD5 method (as it is using DIFFs between
revisions) it reflects much better what users (and the Wikipedia community as a whole) see
as a revert. It thereby is a valid alternative if you are interested in the antagonistic
relationships between users on a more detailed and accurate level. There is quite some
potential to make it even faster by combining the two methods, decreasing the number of
DIFFs to be performed, let's see if we can come around doing that :)

The scripts and results listed in the paper can be found at
http://people.aifb.kit.edu/ffl/reverts/

Best,

Fabian

--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods

Dipl.-Medwiss. Fabian Flöck
Research Associate

Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe

Phone: +49 721 608 4 6584
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu>
WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association