[Foundation-l] Frequency of Seeing Bad Versions - now with traffic data

Robert Rohde rarohde at gmail.com
Thu Aug 27 16:41:29 UTC 2009


Recently, I reported on a simple study of how likely one was to
encounter recent vandalism in Wikipedia based on selecting articles at
random and using revert behavior as a proxy for recent vandalism.

http://lists.wikimedia.org/pipermail/foundation-l/2009-August/054171.html

One of the key limitations of that work was that it was looking at
articles selected at random from the pool of all existing page titles.
 That approach was of the most immediate interest to me, but it didn't
directly address the likelihood of encountering vandalism based on the
way that Wikipedia is actually used because the selection of articles
that people choose to visit is highly non-random.

I've now redone that analysis with a crude traffic based weighting.
For traffic information I used the same data stream used by
http://stats.grok.se.  That data is recorded hourly.  For simplicity I
chose 20 hours at random from the last eight months and averaged those
together to get a rough picture of the relative prominence of pages.
I then chose a selection of 30000 articles at random with their
probability of selection proportional to the traffic they received,
and repeated the prior analysis previously described.  (Note that this
has the effect of treating the prominence of each page as a constant
over time.  In practice we know some pages rise to prominence while
other fall down, but I am assuming the average pattern is still a good
enough approximation to be useful.)



More information about the foundation-l mailing list