[Foundation-l] Frequency of Seeing Bad Versions - now with traffic data

Fri Aug 28 15:10:30 UTC 2009

On Fri, Aug 28, 2009 at 10:08 AM, Thomas Dalton <thomas.dalton at gmail.com>wrote:

> 2009/8/28 Anthony <wikimail at inbox.org>:
> > If you're going to do it, maybe we should work on a rough-consensus
> > objective definition of "vandalism" before you release the file,
> though...
>
> Don't we have a consensus definition already? Vandalism is bad faith
> editing. You may also want to include test edits since they are
> treated in the same way (just with different warning messages). That
> isn't objective, but it should be close enough. We can argue over a
> few borderline cases.

Well, it relies on information (intent) which we can't determine simply from
the content of the edit (sometimes it is implied if you look at the entire
behavior of the user, but that's too messy).  Is a POV edit "vandalism"?  I
think it has to be treated as such, at least some of the time ("Windows is
the worst operating system ever"), but there are certainly edits which are
clearly POV but the intent is unclear (many people don't know the rules).
We need to remove intent from the definition, and I suppose call it
"degraded articles".  But simply saying that anything POV is vandalism would
potentially include just about any large article.

I suppose we can just list everything that's arguably vandalism and then
categorize it later though.  I expect we'll come up with several different
final numbers, which I guess is okay (the only part that really needs to be
pristinely unbiased is the selection of pageviews), though I do expect some
people will adapt their definition of vandalism to fit the data.

I support the request for 5000 random pageviews (uniform distribution
> by pageview over the last 6 months) from the logs.

Seems like it could be reused for a lot of different types of studies, so
long as the researcher isn't exposed to the details of the urls before
coming up with his/her methodology.  And I think the analysis of those 5000
pageviews in all sorts of ways would "crowdsource" well.  I'd love to see a
"Nature Study" equivalent, analyzing the more subjective aspects of the
articles in addition to just plain old vandalized/not-vandalized.

If we can't get the 5000 random pageviews (do the logs even still exist?), I
suppose wikistats will do.  They have pageviews broken down by hour, so the
non-uniformity of a single hour is probably fairly small for the popular
pages most likely to be selected.  Worst part is that it's a whole lot of
data to download, and I'm not sure any shortcuts can be taken without
screwing up the non-uniformity.  I considered just downloading the
projectcounts and then selecting the date-hours weighted accordingly then
downloading only the date-hour files needed, but that does potentially
introduce error if the non-article traffic isn't well correlated to the
article traffic, so I dunno.  Probably a safe assumption that they are well
correlated, but I'd rather not guess.  Maybe talk-page traffic is highly
correlated to increased vandalism, or decreased vandalism.  It's possible,
so I'd rather be safe.