[Foundation-l] Frequency of Seeing Bad Versions - now with traffic data
Anthony
wikimail at inbox.org
Fri Aug 28 01:47:45 UTC 2009
Just took a quick sample of 10 instances of vandalism to [[Ted Stevens]].
Of those 10 instances of vandalism, either 2 or 4 would not have been found
by the automated tool described. 2 if every edit summary containing the
word "vandalism" is counted as vandalism, and 4 if not. The former would
probably significantly overcount vandalism.
http://en.wikipedia.org/w/index.php?title=Ted_Stevens&diff=173527553&oldid=173381871
(Removed
vandalism)
http://en.wikipedia.org/w/index.php?title=Ted_Stevens&diff=180054904&oldid=179982198
(rmv
vandalism)
http://en.wikipedia.org/w/index.php?title=Ted_Stevens&diff=168486242&oldid=168438600
no
edit summary
http://en.wikipedia.org/w/index.php?title=Ted_Stevens&diff=162332870&oldid=162038733
(yes
it is funny, but this doesn't belong here)
On Thu, Aug 27, 2009 at 9:31 PM, Thomas Dalton <thomas.dalton at gmail.com>
wrote:
> 2009/8/28 Anthony <wikimail at inbox.org>:
> > I suggested a better approach last time we had this thread: statistical
> > sampling.
>
> This research was based on a sample. What are you talking about?
I'm talking about taking a sample and examining it manually. First, spend a
few weeks coming up with an objective definition of vandalism. Then pick
5,000 random article views from the http log, and publish the URL/date/time.
Then advertise the list all over the place (especially on sites like
Wikipedia Review) asking people to find instances of vandalism in it.
People can use automated means which they then go through by hand to remove
false positives, manual error checking, spot checking, whatever. The number
of confirmed instances of vandalism will grow for a while, and eventually
will start to level off.
May not be perfect, but it'll provide a lower bound on the amount of
vandalism, at least. Have a statistician tell us what our exact error
bounds are. And then prepare for a second study, improving on everything
(the definition of "vandalism", the number of random article views, the
amount of time to wait) based on what we learned.
More information about the wikimedia-l
mailing list