On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawrich@gmail.com wrote:
My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators.
Any automated method of finding vandalism is doomed to failure. I'd say its informativeness was precisely zero.
Greg's analysis, on the other hand, was informative, but it was targeted at a much different question than Robert's.
"if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision" The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of "vandalism", though.
On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales jwales@wikia-inc.com wrote:
Is there a possibility of re-running the numbers to include traffic weightings?
definitely should be done
I would hypothesize from experience that if we adjust the "random page" selection to account for traffic (to get a better view of what people are actually seeing) we would see slightly different results.
I think we'd see drastically different results.
I think we would see a lot less (percentagewise) vandalism that persists for a really long time for precisely the reason you identified: most vandalism that lasts a long time, lasts a long time because it is on obscure pages that no one is visiting.
Agreed. On the other hand, I think we'd also see that pages with more traffic are more likely to be vandalized.
Of course, this assumes a valid methodology. Using "admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like "rv", "rvv", etc." to find vandalism is heavily skewed toward vandalism that doesn't last very long (or at least doesn't last very many edits). It's basically useless.