2009/8/20 Gregory Maxwell gmaxwell@gmail.com:
Going back to your simple study now: The analysis of vandalism duration and its impact on readers makes an assumption about readership which we know to be invalid. You're assuming a uniform distribution of readership: That readers are just as likely to read any random article. But we know that the actual readership follows a power-law (long-tail) distribution. Because of the failure to consider traffic levels we can't draw conclusions on how much vandalism readers are actually exposed to.
We're also assuming a uniform distribution of vandalism, as it were. There's a number of different types of vandalism; obscene defacement, malicious alteration of factual content, meaningless test edits of a character or two, schoolkids leaving messages for each other...
...and it all has a different impact on the reader.
This has two implications:
a) It seems safe to assume that replacing the entire article with "john is gay" is going to get spotted and reverted faster, on average, than an edit providing a plausible-sounding but entirely fictional history for a small town in Kansas. So, any changes in the pattern of the *content* of vandalism is going to lead to changes in the duration and thus overall frequency of it, even if the amount of vandal edits is constant.
b) We can easily compare the difference in effect for vandalism to be left on differently trafficed pages for various times - roughly speaking, time * traffic = number of readers affected. If some vandalism is worse than others, we could thus also calculate some kind of intensity metric - one hundred people viewing enormous genital piercing images on [[Kitten]] is probably worse than ten thousand people viewing "asdfdfggfh" at the end of a paragraph in the same article.
I'm not sure how we'd go ahead with the second one, but it's an interesting thing to think about.