One example I've seen of md5's failing is in common short vandalism phrases. The most obvious of these is removing all text, or single vulgar words.

Here's a scenario:
User C isn't really reverting B's edit. You may be able to guess that since A was reverted, C must not be reverting, but this logic can be tricky and wrong.

In practice, I've been able to catch most of these instances by a) ignoring md5 reverts outside a certain window of revisions and b) ignoring md5 reverts whose replace lots of text with very little text.

Also, this situation occurs regularly, but nowhere near 37%.

-Shilad

On Fri, Jun 29, 2012 at 11:15 AM, Benj. Mako Hill <mako@mit.edu> wrote:
<quote who="Aaron Halfaker" date="Wed, Jun 27, 2012 at 04:39:30PM -0700">
> I'm confused by your explanation.
>
> How is it possible that this 37% of revisions that are detected as reverts
> via a md5 hash are not considered reverts by (I presume) humans?  Can you
> give a common example?  By definition, identity revert revisions represent
> an exact replica of a previous revision in an article and, therefore,
> should discard any intermediate changes.  What definition of "revert" are
> you using that the md5 hash method does not satisfy?

Also, I can't tell from either the paper or the conversation here: Are
Are you limiting this to edits that are separated by an revisions with
identical hashes by only one edit? When you do that, things become a
bit more complicated.

And are you sure your human coders aren't just relying on edit
summaries? Like Aaron, I'm having a hard time imagining a situation
where revisions go HASH-A => HASH-B => HASH-A that shouldn't be
treated like a revert and think tend to think this sounds more like
fallible than broken tools. If the user doesn't *know* or think they
are reverting an edit, it seems wrong to *not* to call that a revert.

Later,
Mako


--
Benjamin Mako Hill
mako@mit.edu
http://mako.cc/

Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Shilad W. Sen
Assistant Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
ssen@macalester.edu
651-696-6273