[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

Robert Rohde rarohde at gmail.com
Thu Aug 20 22:36:45 UTC 2009


On Thu, Aug 20, 2009 at 2:10 PM, Anthony<wikimail at inbox.org> wrote:
> On Thu, Aug 20, 2009 at 1:55 PM, Nathan <nawrich at gmail.com> wrote:
>>
>> My point (which might still be incorrect, of course) was that an analysis
>> based on 30,000 randomly selected pages was more informative about the
>> English Wikipedia than 100 articles about serving United States Senators.
>
>
> Any automated method of finding vandalism is doomed to failure.  I'd say its
> informativeness was precisely zero.
>
> Greg's analysis, on the other hand, was informative, but it was targeted at
> a much different question than Robert's.
>
> "if one chooses a random page from Wikipedia right now, what is the
> probability of receiving a vandalized revision"  The best way to answer that
> question would be with a manually processed random sample taken from a
> pre-chosen moment in time.  As few as 1000 revisions would probably be
> sufficient, if I know anything about statistics, but I'll let someone with
> more knowledge of statistics verify or refute that.  The results will depend
> heavily on one's definition of "vandalism", though.

Only in dreadfully obvious cases can you look at a revision by itself
and know it contains vandalism.  If the goal is really to characterize
whether any vandalism has persisted in an article from any time in the
past, then one really needs to look at the full edit history to see
what has been changed / removed over time.

Even at the level of randomly sampling 1000 revisions, doing an real
evaluation of the full history is likely to be impractical for any
manual process.

If however you restrict yourself to asking whether 1000 edits
contributed vandalism, then you have a relatively manageable task, and
one that is more closely analogous to the technical program I set up.
If it helps one can think of what I did as trying to characterize
reverts and detect the persistence of "new vandalism" rather than
"vandalism" in general.  And of course, only "new vandalism" could be
fixed by an immediate rollback / revert anyway.

Qualitatively I tend to think that vandalism that has persisted
through many intervening revisions is in a rather different category
than new vandalism.  Since people rarely look at or are aware of an
articles' ancient past, such persistent vandalism is at that point
little different than any other error in an article.  It is something
to be fixed, but you won't usually be able to recognize it as a
malicious act.

> On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales <jwales at wikia-inc.com> wrote:
>>
>> Is there a possibility of re-running the numbers to include traffic
>> weightings?
>>
>
> definitely should be done

Does anyone have a nice comprehensive set of page traffic aggregated
at say a month level?  The raw data used by stats.grok.se, etc. is
binned hourly which opens one up to issues of short-term fluctuations,
but I'm not at all interested in downloading 35 GB of hourly files
just to construct my own long-term averages.

>> I would hypothesize from experience that if we adjust the "random page"
>> selection to account for traffic (to get a better view of what people
>> are actually seeing) we would see slightly different results.
>>
>
> I think we'd see drastically different results.

If I had to make a prediction, I'd expect one might see numerically
higher rates of vandalism and shorter average durations, but otherwise
qualitatively similar results given the same methodology.  I agree
though that it would be worth doing the experiment.

>> I think we would see a lot less (percentagewise) vandalism that persists
>> for a really long time for precisely the reason you identified: most
>> vandalism that lasts a long time, lasts a long time because it is on
>> obscure pages that no one is visiting.
>
> Agreed.  On the other hand, I think we'd also see that pages with more
> traffic are more likely to be vandalized.
>
> Of course, this assumes a valid methodology.  Using "admin rollback, the
> undo
> function, the revert bots, various editing tools, and commonly used
> phrases like "rv", "rvv", etc." to find vandalism is heavily skewed toward
> vandalism that doesn't last very long (or at least doesn't last very many
> edits).  It's basically useless.

Yes, as I acknowledged above, "new vandalism".  My personal interest
is also skewed in that direction.  If you don't like it and don't find
it useful, feel free to ignore me and/or do your own analysis.
Vandalism that has persisted through many revisions is a qualitatively
different critter than most new vandalism.  It's usually hard to
identify, even by a manual process, and is unlikely to be fixed except
through the normal editoral process of review, fact-checking, and
revision.  When vandalism is "new" people are at least paying
attention to it in particular, and all vandalism starts out that way.
Perhaps it would be more useful if you think of this work as a
characterization of revert statistics?

Anyway, I provided my data point and described what I did so others
could judge it for themselves.  Regardless of your opinion, it
addressed an issue of interest to me, and I would hope others also
find some useful insight in it.

-Robert Rohde



More information about the foundation-l mailing list