[Foundation-l] Frequency of Seeing Bad Versions - now with traffic data

Andrew Turvey andrewrturvey at googlemail.com
Thu Aug 27 16:56:47 UTC 2009


very interesting research - many thanks for sharing that. 

----- "Robert Rohde" <rarohde at gmail.com> wrote: 
> From: "Robert Rohde" <rarohde at gmail.com> 
> To: "Wikimedia Foundation Mailing List" <foundation-l at lists.wikimedia.org> 
> Sent: Thursday, 27 August, 2009 17:41:29 GMT +00:00 GMT Britain, Ireland, Portugal 
> Subject: [Foundation-l] Frequency of Seeing Bad Versions - now with traffic data 
> 
> Recently, I reported on a simple study of how likely one was to 
> encounter recent vandalism in Wikipedia based on selecting articles at 
> random and using revert behavior as a proxy for recent vandalism. 
> 
> http://lists.wikimedia.org/pipermail/foundation-l/2009-August/054171.html 
> 
> One of the key limitations of that work was that it was looking at 
> articles selected at random from the pool of all existing page titles. 
> That approach was of the most immediate interest to me, but it didn't 
> directly address the likelihood of encountering vandalism based on the 
> way that Wikipedia is actually used because the selection of articles 
> that people choose to visit is highly non-random. 
> 
> I've now redone that analysis with a crude traffic based weighting. 
> For traffic information I used the same data stream used by 
> http://stats.grok.se. That data is recorded hourly. For simplicity I 
> chose 20 hours at random from the last eight months and averaged those 
> together to get a rough picture of the relative prominence of pages. 
> I then chose a selection of 30000 articles at random with their 
> probability of selection proportional to the traffic they received, 
> and repeated the prior analysis previously described. (Note that this 
> has the effect of treating the prominence of each page as a constant 
> over time. In practice we know some pages rise to prominence while 
> other fall down, but I am assuming the average pattern is still a good 
> enough approximation to be useful.) 
> 
> From this sample I found 5,955,236 revert events in 38,096,653 edits. 
> This is an increase of 29 times in edit frequency and 58 times the 
> number of revert events that were found from a uniform sampling of 
> pages. I suspect it surprises no one that highly trafficked pages are 
> edited more often and subject to more vandalism than the average page, 
> though it might not have been obvious that the the ratio of reverts to 
> normal edits is also increased over more obscure pages. 
> 
> As before, the revert time distribution has a very long tail, though 
> as predicted the times are generally reduced when traffic weighting is 
> applied. In the traffic weighted sample, the median time to revert is 
> 3.4 minutes and the mean time is 2.2 hours (compared to 6.7 minutes 
> and 18.2 hours with uniform weighting). Again, I think it is worth 
> acknowledging that having a majority of reverts occur within only a 
> few minutes is a strong testament to the efficiency and dedication 
> with which new edits are usually reviewed by the community. We could 
> be much worse off if most things weren't caught so quickly. 
> 
> Unfortunately, in comparing the current analysis to the previous one, 
> the faster response time is essentially being overwhelmed by the much 
> larger number of vandalism occurrences. The net result is that 
> averaged over the whole history of Wikipedia a visitor would be 
> expected to receive a recently degraded article version during about 
> 1.1% of requests (compared to ~0.37% in the uniform weighting 
> estimate). The last six months averaged a slightly higher 1.3% (1 in 
> 80 requests). As before, most of the degraded content that people are 
> likely to actually encounter is coming from the subset of things that 
> get by the initial monitors and survive for a long time. Among edits 
> that are eventually reverted the longest lasting 5% of bad content 
> (those edits taking > 7.2 hours to revert) is responsible for 78% of 
> the expected encounters with recently degraded material. One might 
> speculate that such long-lived material is more likely to reflect 
> subtle damage to a page rather than more obvious problems like page 
> blanking. I did not try to investigate this. 
> 
> In my sample, the number of reverts being made to articles has 
> declined ~40% since a peak in late 2006. However, the mean and median 
> time to revert is little changed over the last two years. What little 
> trend exists points in the direction of slightly slower responses. 
> 
> 
> So to summarize, the results here are qualitatively similar to those 
> found in the previous work. However with traffic weighting we find 
> quantitative differences such that reverts occur much more often but 
> take less time to be executed. The net effect of these competing 
> factors is such that the bad content is more likely to be seen than 
> suggested by the uniform weighting. 
> 
> -Robert Rohde 
> 
> _______________________________________________ 
> foundation-l mailing list 
> foundation-l at lists.wikimedia.org 
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l 
> 


More information about the foundation-l mailing list