2009/8/21 Anthony wikimail@inbox.org:
If we are only interested in whether the most recent revision is vandalised then that is a simpler problem but would require a much larger sample to get the same quality of data.
How much larger? Do you know anything about this, or you're just guessing? The number of random samples needed for a high degree of confidence tends to be much much less than most people suspect. That much I know.
I have a Masters degree in Mathematics, so I know a little about the subject. (I didn't study much statistics, but you can't do 4 years of Maths at Uni without getting some basic understanding of it.)
You say it requires 7649 articles, which sounds about right to me. If we looked through the entire history (or just the last year or 6 months or something if you want just recent data) then we could do it with significantly fewer articles. I'm not sure how many we would need, though. I think we need to know what the distribution is for how long a randomly chosen article spends in a vandalised state before we can work out what the distribution of the average would be. My statistics isn't good enough to even work out what kind of distribution it is likely to be, I certainly can't guess at the parameters. It obviously ranges between 0% and 100% with the mean somewhere close to 0% (0.4% seems like a good estimate) and will presumably have a long tail (truncated at 100%) - there are articles that spend their entire life in a vandalised state (attack pages, for example) and there is a chance we'll completely miss such a page and it will last the entire length of the survey period, so the probability density at 100% won't be 0. I'm sure there is a distribution that satisfies those requirements, but I don't know what it is.