very interesting research - many thanks for sharing that.
----- "Robert Rohde" rarohde@gmail.com wrote:
From: "Robert Rohde" rarohde@gmail.com To: "Wikimedia Foundation Mailing List" foundation-l@lists.wikimedia.org Sent: Thursday, 27 August, 2009 17:41:29 GMT +00:00 GMT Britain, Ireland, Portugal Subject: [Foundation-l] Frequency of Seeing Bad Versions - now with traffic data
Recently, I reported on a simple study of how likely one was to encounter recent vandalism in Wikipedia based on selecting articles at random and using revert behavior as a proxy for recent vandalism.
http://lists.wikimedia.org/pipermail/foundation-l/2009-August/054171.html
One of the key limitations of that work was that it was looking at articles selected at random from the pool of all existing page titles. That approach was of the most immediate interest to me, but it didn't directly address the likelihood of encountering vandalism based on the way that Wikipedia is actually used because the selection of articles that people choose to visit is highly non-random.
I've now redone that analysis with a crude traffic based weighting. For traffic information I used the same data stream used by http://stats.grok.se. That data is recorded hourly. For simplicity I chose 20 hours at random from the last eight months and averaged those together to get a rough picture of the relative prominence of pages. I then chose a selection of 30000 articles at random with their probability of selection proportional to the traffic they received, and repeated the prior analysis previously described. (Note that this has the effect of treating the prominence of each page as a constant over time. In practice we know some pages rise to prominence while other fall down, but I am assuming the average pattern is still a good enough approximation to be useful.)
From this sample I found 5,955,236 revert events in 38,096,653 edits. This is an increase of 29 times in edit frequency and 58 times the number of revert events that were found from a uniform sampling of pages. I suspect it surprises no one that highly trafficked pages are edited more often and subject to more vandalism than the average page, though it might not have been obvious that the the ratio of reverts to normal edits is also increased over more obscure pages.
As before, the revert time distribution has a very long tail, though as predicted the times are generally reduced when traffic weighting is applied. In the traffic weighted sample, the median time to revert is 3.4 minutes and the mean time is 2.2 hours (compared to 6.7 minutes and 18.2 hours with uniform weighting). Again, I think it is worth acknowledging that having a majority of reverts occur within only a few minutes is a strong testament to the efficiency and dedication with which new edits are usually reviewed by the community. We could be much worse off if most things weren't caught so quickly.
Unfortunately, in comparing the current analysis to the previous one, the faster response time is essentially being overwhelmed by the much larger number of vandalism occurrences. The net result is that averaged over the whole history of Wikipedia a visitor would be expected to receive a recently degraded article version during about 1.1% of requests (compared to ~0.37% in the uniform weighting estimate). The last six months averaged a slightly higher 1.3% (1 in 80 requests). As before, most of the degraded content that people are likely to actually encounter is coming from the subset of things that get by the initial monitors and survive for a long time. Among edits that are eventually reverted the longest lasting 5% of bad content (those edits taking > 7.2 hours to revert) is responsible for 78% of the expected encounters with recently degraded material. One might speculate that such long-lived material is more likely to reflect subtle damage to a page rather than more obvious problems like page blanking. I did not try to investigate this.
In my sample, the number of reverts being made to articles has declined ~40% since a peak in late 2006. However, the mean and median time to revert is little changed over the last two years. What little trend exists points in the direction of slightly slower responses.
So to summarize, the results here are qualitatively similar to those found in the previous work. However with traffic weighting we find quantitative differences such that reverts occur much more often but take less time to be executed. The net effect of these competing factors is such that the bad content is more likely to be seen than suggested by the uniform weighting.
-Robert Rohde
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l