[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Anthony
wikimail at inbox.org
Fri Aug 21 01:47:18 UTC 2009
On Thu, Aug 20, 2009 at 9:30 PM, Mark Wagner <carnildo at gmail.com> wrote:
> On Thu, Aug 20, 2009 at 14:10, Anthony<wikimail at inbox.org> wrote:
> > "if one chooses a random page from Wikipedia right now, what is the
> > probability of receiving a vandalized revision" The best way to answer
> that
> > question would be with a manually processed random sample taken from a
> > pre-chosen moment in time. As few as 1000 revisions would probably be
> > sufficient, if I know anything about statistics, but I'll let someone
> with
> > more knowledge of statistics verify or refute that. The results will
> depend
> > heavily on one's definition of "vandalism", though.
>
> I did this in an informal fashion in 2005 during my "hundred article"
> surveys. Of the 503 pages I looked at, only one was clearly
> vandalized the first time I looked at it, so I'd say a thousand
> samples is probably too small to get any sort of precision on the
> vandalism rate.
Why? My understanding is that, if your methodology was correct, you can say
with 96% confidence that the percentage of vandalized articles is less than
0.6%. That's useful. With 1000 samples, if you found two instances of
vandalism, you'd have a 97% confidence that the percentage of vandalized
articles is less than 0.5%.
I don't think it's that low, but if you publish the details of your "hundred
article" surveys, I might be persuaded that it is.
If we really do have that figure to that level of assurance, a more useful
statistic would be the percentage of pageviews that result in a vandalized
article. That could be arrived at by weighting by pageviews while choosing
your random sample.
One flaw I found in my proposed methodology is that the "moment in time"
needs to be randomized, since certain times of the day/week/year might very
well experience higher vandalism than others.
More information about the wikimedia-l
mailing list