One example I've seen of md5's failing is in common short vandalism phrases. The most obvious of these is removing all text, or single vulgar words.
<quote who="Aaron Halfaker" date="Wed, Jun 27, 2012 at 04:39:30PM -0700">
> I'm confused by your explanation.Also, I can't tell from either the paper or the conversation here: Are
>
> How is it possible that this 37% of revisions that are detected as reverts
> via a md5 hash are not considered reverts by (I presume) humans? Can you
> give a common example? By definition, identity revert revisions represent
> an exact replica of a previous revision in an article and, therefore,
> should discard any intermediate changes. What definition of "revert" are
> you using that the md5 hash method does not satisfy?
Are you limiting this to edits that are separated by an revisions with
identical hashes by only one edit? When you do that, things become a
bit more complicated.
And are you sure your human coders aren't just relying on edit
summaries? Like Aaron, I'm having a hard time imagining a situation
where revisions go HASH-A => HASH-B => HASH-A that shouldn't be
treated like a revert and think tend to think this sounds more like
fallible than broken tools. If the user doesn't *know* or think they
are reverting an edit, it seems wrong to *not* to call that a revert.
Later,
Mako
--
Benjamin Mako Hill
mako@mit.edu
http://mako.cc/
Creativity can be a social contribution, but only in so far
as society is free to use the results. --GNU Manifesto
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l