Robert Rohde wrote:
When one downloads a dump file, what percentage of the pages are actually in a vandalized state?
This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision?
Is there a possibility of re-running the numbers to include traffic weightings?
I would hypothesize from experience that if we adjust the "random page" selection to account for traffic (to get a better view of what people are actually seeing) we would see slightly different results.
I think we would see a lot less (percentagewise) vandalism that persists for a really long time for precisely the reason you identified: most vandalism that lasts a long time, lasts a long time because it is on obscure pages that no one is visiting. That doesn't mean it is not a problem, but it does change some thinking about what kinds of tools are needed to deal with that problem.
I'm not sure what else would change.