One example I've seen of md5's failing is in common short vandalism
phrases. The most obvious of these is removing all text, or single vulgar
words.
Here's a scenario:
- User A vandalizes an article by replacing it with "poop."
- User B restores the article.
- Some time passes....
- User C vandalizes an article by replacing it with "poop."
User C isn't really reverting B's edit. You may be able to guess that since
A was reverted, C must not be reverting, but this logic can be tricky and
wrong.
In practice, I've been able to catch most of these instances by a) ignoring
md5 reverts outside a certain window of revisions and b) ignoring md5
reverts whose replace lots of text with very little text.
Also, this situation occurs regularly, but nowhere near 37%.
-Shilad
On Fri, Jun 29, 2012 at 11:15 AM, Benj. Mako Hill <mako(a)mit.edu> wrote:
<quote who="Aaron Halfaker"
date="Wed, Jun 27, 2012 at 04:39:30PM -0700">
I'm confused by your explanation.
How is it possible that this 37% of revisions that are detected as
reverts
via a md5 hash are not considered reverts by (I
presume) humans? Can you
give a common example? By definition, identity revert revisions
represent
an exact replica of a previous revision in an
article and, therefore,
should discard any intermediate changes. What definition of "revert" are
you using that the md5 hash method does not satisfy?
Also, I can't tell from either the paper or the conversation here: Are
Are you limiting this to edits that are separated by an revisions with
identical hashes by only one edit? When you do that, things become a
bit more complicated.
And are you sure your human coders aren't just relying on edit
summaries? Like Aaron, I'm having a hard time imagining a situation
where revisions go HASH-A => HASH-B => HASH-A that shouldn't be
treated like a revert and think tend to think this sounds more like
fallible than broken tools. If the user doesn't *know* or think they
are reverting an edit, it seems wrong to *not* to call that a revert.
Later,
Mako
--
Benjamin Mako Hill
mako(a)mit.edu
http://mako.cc/
Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Shilad W. Sen
Assistant Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
ssen(a)macalester.edu
651-696-6273