[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Anthony
wikimail at inbox.org
Fri Aug 21 00:02:53 UTC 2009
On Thu, Aug 20, 2009 at 7:54 PM, Thomas Dalton <thomas.dalton at gmail.com>wrote:
> 2009/8/21 Anthony <wikimail at inbox.org>:
> > "Is this article vandalized?" is a yes/no question...
>
> True, but that isn't actually the question that this research tried to
> answer. It tried to answer "How much time has this article spent in a
> vandalised state?".
"When one downloads a dump file, what percentage of the pages are
actually in a vandalized state?"
"This is equivalent to asking, if one chooses a random page from Wikipedia
right now, what is the probability of receiving a vandalized revision?"
That's the question I was referring to.
> If we are only interested in whether the most
> recent revision is vandalised then that is a simpler problem but would
> require a much larger sample to get the same quality of data.
How much larger? Do you know anything about this, or you're just guessing?
The number of random samples needed for a high degree of confidence tends
to be much much less than most people suspect. That much I know.
I found one problem with my use of http://www.raosoft.com/samplesize.html
<http://www.raosoft.com/samplesize.html>I was specifying a margin of error
of 5%. But that's an absolute margin of error. So if it were 0.2%
vandalism, that'd be 0.2% plus or minus 5%. Obviously unacceptable.
However, the response distribution would then be 0.2%. This still would
require 7649 samples for a 95% confidence plus or minus 0.1%. If the
vandalism turned out to be more prevalent though, and I suspect it would, we
could for instance be 95% confident plus or minus 0.5% if the response
distribution was 0.5% and we had 765 samples.
More information about the wikimedia-l
mailing list