On Thu, Aug 20, 2009 at 7:54 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/8/21 Anthony wikimail@inbox.org:
"Is this article vandalized?" is a yes/no question...
True, but that isn't actually the question that this research tried to answer. It tried to answer "How much time has this article spent in a vandalised state?".
"When one downloads a dump file, what percentage of the pages are actually in a vandalized state?"
"This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision?"
That's the question I was referring to.
If we are only interested in whether the most recent revision is vandalised then that is a simpler problem but would require a much larger sample to get the same quality of data.
How much larger? Do you know anything about this, or you're just guessing? The number of random samples needed for a high degree of confidence tends to be much much less than most people suspect. That much I know.
I found one problem with my use of http://www.raosoft.com/samplesize.html
http://www.raosoft.com/samplesize.htmlI was specifying a margin of error of 5%. But that's an absolute margin of error. So if it were 0.2% vandalism, that'd be 0.2% plus or minus 5%. Obviously unacceptable.
However, the response distribution would then be 0.2%. This still would require 7649 samples for a 95% confidence plus or minus 0.1%. If the vandalism turned out to be more prevalent though, and I suspect it would, we could for instance be 95% confident plus or minus 0.5% if the response distribution was 0.5% and we had 765 samples.