[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

Anthony wikimail at inbox.org
Fri Aug 21 01:47:18 UTC 2009


On Thu, Aug 20, 2009 at 9:30 PM, Mark Wagner <carnildo at gmail.com> wrote:

> On Thu, Aug 20, 2009 at 14:10, Anthony<wikimail at inbox.org> wrote:
> > "if one chooses a random page from Wikipedia right now, what is the
> > probability of receiving a vandalized revision"  The best way to answer
> that
> > question would be with a manually processed random sample taken from a
> > pre-chosen moment in time.  As few as 1000 revisions would probably be
> > sufficient, if I know anything about statistics, but I'll let someone
> with
> > more knowledge of statistics verify or refute that.  The results will
> depend
> > heavily on one's definition of "vandalism", though.
>
> I did this in an informal fashion in 2005 during my "hundred article"
> surveys.  Of the 503 pages I looked at, only one was clearly
> vandalized the first time I looked at it, so I'd say a thousand
> samples is probably too small to get any sort of precision on the
> vandalism rate.


Why?  My understanding is that, if your methodology was correct, you can say
with 96% confidence that the percentage of vandalized articles is less than
0.6%.  That's useful.  With 1000 samples, if you found two instances of
vandalism, you'd have a 97% confidence that the percentage of vandalized
articles is less than 0.5%.

I don't think it's that low, but if you publish the details of your "hundred
article" surveys, I might be persuaded that it is.

If we really do have that figure to that level of assurance, a more useful
statistic would be the percentage of pageviews that result in a vandalized
article.  That could be arrived at by weighting by pageviews while choosing
your random sample.

One flaw I found in my proposed methodology is that the "moment in time"
needs to be randomized, since certain times of the day/week/year might very
well experience higher vandalism than others.



More information about the wikimedia-l mailing list