[Foundation-l] Frequency of Seeing Bad Versions - now with traffic data
Andrew Turvey
andrewrturvey at googlemail.com
Thu Aug 27 16:56:47 UTC 2009
very interesting research - many thanks for sharing that.
----- "Robert Rohde" <rarohde at gmail.com> wrote:
> From: "Robert Rohde" <rarohde at gmail.com>
> To: "Wikimedia Foundation Mailing List" <foundation-l at lists.wikimedia.org>
> Sent: Thursday, 27 August, 2009 17:41:29 GMT +00:00 GMT Britain, Ireland, Portugal
> Subject: [Foundation-l] Frequency of Seeing Bad Versions - now with traffic data
>
> Recently, I reported on a simple study of how likely one was to
> encounter recent vandalism in Wikipedia based on selecting articles at
> random and using revert behavior as a proxy for recent vandalism.
>
> http://lists.wikimedia.org/pipermail/foundation-l/2009-August/054171.html
>
> One of the key limitations of that work was that it was looking at
> articles selected at random from the pool of all existing page titles.
> That approach was of the most immediate interest to me, but it didn't
> directly address the likelihood of encountering vandalism based on the
> way that Wikipedia is actually used because the selection of articles
> that people choose to visit is highly non-random.
>
> I've now redone that analysis with a crude traffic based weighting.
> For traffic information I used the same data stream used by
> http://stats.grok.se. That data is recorded hourly. For simplicity I
> chose 20 hours at random from the last eight months and averaged those
> together to get a rough picture of the relative prominence of pages.
> I then chose a selection of 30000 articles at random with their
> probability of selection proportional to the traffic they received,
> and repeated the prior analysis previously described. (Note that this
> has the effect of treating the prominence of each page as a constant
> over time. In practice we know some pages rise to prominence while
> other fall down, but I am assuming the average pattern is still a good
> enough approximation to be useful.)
>
> From this sample I found 5,955,236 revert events in 38,096,653 edits.
> This is an increase of 29 times in edit frequency and 58 times the
> number of revert events that were found from a uniform sampling of
> pages. I suspect it surprises no one that highly trafficked pages are
> edited more often and subject to more vandalism than the average page,
> though it might not have been obvious that the the ratio of reverts to
> normal edits is also increased over more obscure pages.
>
> As before, the revert time distribution has a very long tail, though
> as predicted the times are generally reduced when traffic weighting is
> applied. In the traffic weighted sample, the median time to revert is
> 3.4 minutes and the mean time is 2.2 hours (compared to 6.7 minutes
> and 18.2 hours with uniform weighting). Again, I think it is worth
> acknowledging that having a majority of reverts occur within only a
> few minutes is a strong testament to the efficiency and dedication
> with which new edits are usually reviewed by the community. We could
> be much worse off if most things weren't caught so quickly.
>
> Unfortunately, in comparing the current analysis to the previous one,
> the faster response time is essentially being overwhelmed by the much
> larger number of vandalism occurrences. The net result is that
> averaged over the whole history of Wikipedia a visitor would be
> expected to receive a recently degraded article version during about
> 1.1% of requests (compared to ~0.37% in the uniform weighting
> estimate). The last six months averaged a slightly higher 1.3% (1 in
> 80 requests). As before, most of the degraded content that people are
> likely to actually encounter is coming from the subset of things that
> get by the initial monitors and survive for a long time. Among edits
> that are eventually reverted the longest lasting 5% of bad content
> (those edits taking > 7.2 hours to revert) is responsible for 78% of
> the expected encounters with recently degraded material. One might
> speculate that such long-lived material is more likely to reflect
> subtle damage to a page rather than more obvious problems like page
> blanking. I did not try to investigate this.
>
> In my sample, the number of reverts being made to articles has
> declined ~40% since a peak in late 2006. However, the mean and median
> time to revert is little changed over the last two years. What little
> trend exists points in the direction of slightly slower responses.
>
>
> So to summarize, the results here are qualitatively similar to those
> found in the previous work. However with traffic weighting we find
> quantitative differences such that reverts occur much more often but
> take less time to be executed. The net effect of these competing
> factors is such that the bad content is more likely to be seen than
> suggested by the uniform weighting.
>
> -Robert Rohde
>
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>
More information about the foundation-l
mailing list