Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

21 Aug 2009


      On Thu, Aug 20, 2009 at 7:54 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
...
2009/8/21 Anthony wikimail@inbox.org:
...
"Is this article vandalized?" is a yes/no question...
True, but that isn't actually the question that this research tried to
answer. It tried to answer "How much time has this article spent in a
vandalised state?".
"When one downloads a dump file, what percentage of the pages are
actually in a vandalized state?"
"This is equivalent to asking, if one chooses a random page from Wikipedia
right now, what is the probability of receiving a vandalized revision?"
That's the question I was referring to.
...
If we are only interested in whether the most
recent revision is vandalised then that is a simpler problem but would
require a much larger sample to get the same quality of data.
How much larger?  Do you know anything about this, or you're just guessing?
 The number of random samples needed for a high degree of confidence tends
to be much much less than most people suspect.  That much I know.
I found one problem with my use of http://www.raosoft.com/samplesize.html
http://www.raosoft.com/samplesize.htmlI was specifying a margin of error
of 5%.  But that's an absolute margin of error.  So if it were 0.2%
vandalism, that'd be 0.2% plus or minus 5%.  Obviously unacceptable.
However, the response distribution would then be 0.2%.  This still would
require 7649 samples for a 95% confidence plus or minus 0.1%.  If the
vandalism turned out to be more prevalent though, and I suspect it would, we
could for instance be 95% confident plus or minus 0.5% if the response
distribution was 0.5% and we had 765 samples.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles