---------- Forwarded message ---------- From: Robert Rohde rarohde@gmail.com Date: Thu, Aug 20, 2009 at 11:06 AM Subject: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org, English Wikipedia wikien-l@lists.wikimedia.org Cc: Sean Moss-Pultz sean@openmoko.com, suh@parc.com
I am supposed to be taking a wiki-vacation to finish my PhD thesis and find a job for next year. However, this afternoon I decided to take a break and consider an interesting question recently suggested to me by someone else:
When one downloads a dump file, what percentage of the pages are actually in a vandalized state?
This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision?
Understanding what fraction of Wikipedia is vandalized at any given instant is obviously of both practical and public relations interest. In addition it bears on the motivation for certain development projects like flagged revisions. So, I decided to generate a rough estimate.
For the purposes of making an estimate I used the main namespace of the English Wikipedia and adopted the following operational approximations: I considered that "vandalism" is that thing which gets reverted, and that "reverts" are those edits tagged with "revert, rv, undo, undid, etc." in the edit summary line. Obviously, not all vandalism is cleanly reverted, and not all reverts are cleanly tagged. In addition, some things flagged as reverts aren't really addressing what we would conventionally consider to be vandalism. Such caveats notwithstanding, I have had some reasonable success with using a revert heuristic in the past. With the right keywords one can easily catch the standardized comments created by admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like "rv", "rvv", etc. It won't be perfect, but it is a quick way of getting an automated estimate. I would usually expect the answer I get in this way to be correct within an order of magnitude, and perhaps within a factor of a few, though it is still just a crude estimate.
I analyzed the edit history up to the mid-June dump for a sample 29,999 main namespace pages (sampling from everything in main including redirects). This included 1,333,829 edits, from which I identified 102,926 episodes of reverted "vandalism". As a further approximation, I assumed that whenever a revert occurred, it applied to the immediately preceding edit and any additional consecutive changes by the same editor (this is how admin rollback operates, but is not necessarily true of tools like undo).
With those assumptions, I then used the timestamps on my identified intervals of vandalism to figure out how much time each page had spent in a vandalized state. Over the entire history of Wikipedia, this sample of pages was vandalized during 0.28% of its existence. Or, more relevantly, focusing on just this year vandalism was present 0.21% of the time, which suggests that one should expect 0.21% of mainspace pages in any recent enwiki dump will be in a vandalized state (i.e. 1 in 480).
(Note that since redirects represent 55% of the main namespace and are rarely vandalized, one could argue that 0.37% [1 in 270] would be a better estimate for the portion of actual articles that are in a vandalized condition at any given moment.)
I also took a look at the time distribution of vandalism. Not surprisingly, it has a very long tail. The median time to revert over the entire history is 6.7 minutes, but the mean time to revert is 18.2 hours, and my sample included one revert going back 45 months (though examples of such very long lags also imply the page had gone years without any edits, which would imply an obscure topic that was also almost never visited). In the recent period these factors becomes 5.2 minutes and 14.4 hours for the median and mean respectively. The observation that nearly 50% of reverts are occurring in 5 minutes or less is a testament to the efficient work of recent changes reviewers and watchlists.
Unfortunately the 5% of vandalism that persists longer than 35 hours is responsible for 90% of the actual vandalism a visitor is likely to encounter at random. Hence, as one might guess, it is the vandalism that slips through and persists the longest that has the largest practical effect.
It is also worth noting that the prevalence figures for February-May of this year are slightly lower than at any time since 2006. There is also a drop in the mean duration of vandalism coupled to a slight increase in the median duration. However, these effects mostly disappear if we limit our considerations to only vandalism events lasting 1 month or shorter. Hence those changes may be in significant part linked to cut-off biasing from longer-term vandalism events that have yet to be identified. The ambiguity in the change from earlier in the year is somewhat surprising as the AbuseFilter was launched in March and was intended to decrease the burden of vandalism. One might speculate that the simple vandalism amenable to the AbuseFilter was already being addressed quickly in nearly all cases and hence its impact on the persistence of vandalism may already have been fairly limited.
I've posted some summary data on the wiki at:
http://en.wikipedia.org/wiki/Wikipedia:Vandalism_statistics
Given the nature of the approximations I made in doing this analysis I suspect it is more likely that I have somewhat underestimated the vandalism problem rather than overestimated it, but as I said in the beginning I'd like to believe I am in the right ballpark. If that's true, I personally think that having less than 0.5% of Wikipedia be vandalized at any given instant is actually rather comforting. It's not a perfect number, but it would suggest that nearly everyone still gets to see Wikipedia as intended rather than in a vandalized state. (Though to be fair I didn't try to figure out if the vandalism occurred in more frequently visited parts or not.)
Unfortunately, that's it for now as I need to get back to my thesis / job search.
-Robert Rohde
_______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
wiki-research-l@lists.wikimedia.org