[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

Anthony wikimail at inbox.org
Thu Aug 20 21:10:44 UTC 2009


On Thu, Aug 20, 2009 at 1:55 PM, Nathan <nawrich at gmail.com> wrote:
>
> My point (which might still be incorrect, of course) was that an analysis
> based on 30,000 randomly selected pages was more informative about the
> English Wikipedia than 100 articles about serving United States Senators.


Any automated method of finding vandalism is doomed to failure.  I'd say its
informativeness was precisely zero.

Greg's analysis, on the other hand, was informative, but it was targeted at
a much different question than Robert's.

"if one chooses a random page from Wikipedia right now, what is the
probability of receiving a vandalized revision"  The best way to answer that
question would be with a manually processed random sample taken from a
pre-chosen moment in time.  As few as 1000 revisions would probably be
sufficient, if I know anything about statistics, but I'll let someone with
more knowledge of statistics verify or refute that.  The results will depend
heavily on one's definition of "vandalism", though.

On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales <jwales at wikia-inc.com> wrote:
>
> Is there a possibility of re-running the numbers to include traffic
> weightings?
>

definitely should be done


> I would hypothesize from experience that if we adjust the "random page"
> selection to account for traffic (to get a better view of what people
> are actually seeing) we would see slightly different results.
>

I think we'd see drastically different results.


> I think we would see a lot less (percentagewise) vandalism that persists
> for a really long time for precisely the reason you identified: most
> vandalism that lasts a long time, lasts a long time because it is on
> obscure pages that no one is visiting.


Agreed.  On the other hand, I think we'd also see that pages with more
traffic are more likely to be vandalized.

Of course, this assumes a valid methodology.  Using "admin rollback, the
undo
function, the revert bots, various editing tools, and commonly used
phrases like "rv", "rvv", etc." to find vandalism is heavily skewed toward
vandalism that doesn't last very long (or at least doesn't last very many
edits).  It's basically useless.


More information about the foundation-l mailing list