[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

Anthony wikimail at inbox.org
Fri Aug 21 00:02:53 UTC 2009


On Thu, Aug 20, 2009 at 7:54 PM, Thomas Dalton <thomas.dalton at gmail.com>wrote:

> 2009/8/21 Anthony <wikimail at inbox.org>:
> > "Is this article vandalized?" is a yes/no question...
>
> True, but that isn't actually the question that this research tried to
> answer. It tried to answer "How much time has this article spent in a
> vandalised state?".


"When one downloads a dump file, what percentage of the pages are
actually in a vandalized state?"

"This is equivalent to asking, if one chooses a random page from Wikipedia
right now, what is the probability of receiving a vandalized revision?"

That's the question I was referring to.


> If we are only interested in whether the most
> recent revision is vandalised then that is a simpler problem but would
> require a much larger sample to get the same quality of data.


How much larger?  Do you know anything about this, or you're just guessing?
 The number of random samples needed for a high degree of confidence tends
to be much much less than most people suspect.  That much I know.

I found one problem with my use of http://www.raosoft.com/samplesize.html

<http://www.raosoft.com/samplesize.html>I was specifying a margin of error
of 5%.  But that's an absolute margin of error.  So if it were 0.2%
vandalism, that'd be 0.2% plus or minus 5%.  Obviously unacceptable.

However, the response distribution would then be 0.2%.  This still would
require 7649 samples for a 95% confidence plus or minus 0.1%.  If the
vandalism turned out to be more prevalent though, and I suspect it would, we
could for instance be 95% confident plus or minus 0.5% if the response
distribution was 0.5% and we had 765 samples.



More information about the wikimedia-l mailing list