[Foundation-l] Frequency of Seeing Bad Versions - now with traffic data

Anthony wikimail at inbox.org
Fri Aug 28 10:55:09 UTC 2009


On Fri, Aug 28, 2009 at 12:43 AM, Brion Vibber <brion at wikimedia.org> wrote:

> On 8/27/09 9:39 PM, Thomas Dalton wrote:
> > 2009/8/28 Gregory Maxwell<gmaxwell at gmail.com>:
> >> If the results of this kind of study have good agreement with
> >> mechanical proxy metrics (such as machine detected vandalism) our
> >> confidence in those proxies will increase, if they disagree it will
> >> provide an opportunity to improve the proxies.
> >
> > This kind of intensive study on a few small sample with a more
> > automated method used on the same sample to compare would be more
> > achievable. If the automated method gets similar results, we can use
> > that method for larger samples.
>
> I would certainly be interested in seeing such a result.


Can you get us 5000 random article views from the http log made during the
first half of 2009?  All we need is URL/date/time.  Everything else can be
blanked for anonymizing.  It can be from a 1/10th log or whatever.  The list
should consist solely of *views*, not edits, and only of articles.

All the rest of the data is out there, unless we happen to hit on a
deleted/oversighted revision.  But using http://dammit.lt/wikistats/ to
estimate the hits is less accurate.  Many popular pages get popular
suddenly, and then quickly fade away.  There is most likely a strong
correlation to the amount of vandalism that takes place while they are
popular to the amount of vandalism that takes place while they are not
popular, so I'd much prefer a sample from the actual http log.

If we can't get the real thing, I'll start downloading from
http://dammit.lt/wikistats/ and generate an estimated one, though.

Once we have the list, anyone is free to examine it any way they want, and
show their results.  But we're talking about probably less than 200
instances of vandalism here, so it'll be quite easy (and fun) to lambaste
anyone whose methods produce false positives.

If you're going to do it, maybe we should work on a rough-consensus
objective definition of "vandalism" before you release the file, though...



More information about the wikimedia-l mailing list