On Thu, Aug 20, 2009 at 6:06 AM, Robert
Rohde<rarohde(a)gmail.com> wrote:
[snip]
When one downloads a dump file, what percentage
of the pages are
actually in a vandalized state?
Although you don't actually answer that question, you answer a
different question:
[snip]
approximations: I considered that
"vandalism" is that thing which
gets reverted, and that "reverts" are those edits tagged with "revert,
rv, undo, undid, etc." in the edit summary line. Obviously, not all
vandalism is cleanly reverted, and not all reverts are cleanly tagged.
Which is interesting too, but part of the problem with calling this a
measure of vandalism is that it isn't really, and we don't really have
a good handle on how solid an approximation it is beyond gut feelings
and arm-waving.
We looked into this a couple of years ago and came up with a similar
number (though I won't quote it because I don't quite remember what it
was), though we estimated the probability that a viewer would encounter
a damaged article rather than how many articles were currently damaged.
We used the term "damaged" instead of "vandalized" for essentially the
reasons you mention (though I confess I didn't fully read your whole
letter).
Priedhorsky et al., GROUP 2007.
Reid