[Foundation-l] Frequency of Seeing Bad Versions - now with traffic data

Fri Aug 28 20:05:24 UTC 2009

On Fri, Aug 28, 2009 at 3:44 PM, Lars Aronsson <lars at aronsson.se> wrote:

> We can try to find out which edits are reverts, assuming that the
> previous edit was an act of vandalism.

But that's a bad assumption.  It gives both false positives and false
negatives, and it gives a significant number of each.  I gave examples of
each above.  My samples were tiny, but 38% of reverts were not reverts of
vandalism, and 40% of vandalism was not reverted by a means detected by this
strategy.  And there is no reason to believe that the error is consistent
over time, so these numbers are useless when it comes to determining whether
or not the problem is increasing.

That way we can conclude
> which articles were vandalized and how long it took to revert
> them.

Your simplistic version of assuming that the previous edit was an act of
vandalism makes the conclusion of "how long it took to revert" pretty
obviously flawed, doesn't it?  In your simplistic assumption (which is even
worse than the one used by Robert), you're simply measuring the average time
between edits.  Any acts of vandalism which take more than one edit to find
and fix are excluded.

Now Robert's methodology wasn't quite that bad.  It allowed for reverts
separated by one or more other edits.   But it had no way to detect an act
of vandalism which lasted for hundreds of edits, was discovered by someone
reading the text, and was removed without reference to the original edit
with an edit summary such as "Barrack Obama was born in Hawaii".  And these
acts of vandalism are the worst.  They last the longest, they do the most
harm when they are read, they get the most views, etc.  Any methodology
which excludes them is systemically biased.