On Thu, Aug 20, 2009 at 9:30 PM, Mark Wagner carnildo@gmail.com wrote:
On Thu, Aug 20, 2009 at 14:10, Anthonywikimail@inbox.org wrote:
"if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision" The best way to answer
that
question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone
with
more knowledge of statistics verify or refute that. The results will
depend
heavily on one's definition of "vandalism", though.
I did this in an informal fashion in 2005 during my "hundred article" surveys. Of the 503 pages I looked at, only one was clearly vandalized the first time I looked at it, so I'd say a thousand samples is probably too small to get any sort of precision on the vandalism rate.
Why? My understanding is that, if your methodology was correct, you can say with 96% confidence that the percentage of vandalized articles is less than 0.6%. That's useful. With 1000 samples, if you found two instances of vandalism, you'd have a 97% confidence that the percentage of vandalized articles is less than 0.5%.
I don't think it's that low, but if you publish the details of your "hundred article" surveys, I might be persuaded that it is.
If we really do have that figure to that level of assurance, a more useful statistic would be the percentage of pageviews that result in a vandalized article. That could be arrived at by weighting by pageviews while choosing your random sample.
One flaw I found in my proposed methodology is that the "moment in time" needs to be randomized, since certain times of the day/week/year might very well experience higher vandalism than others.