[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Gregory Maxwell
gmaxwell at gmail.com
Thu Aug 20 16:34:25 UTC 2009
On Thu, Aug 20, 2009 at 6:06 AM, Robert Rohde<rarohde at gmail.com> wrote:
[snip]
> When one downloads a dump file, what percentage of the pages are
> actually in a vandalized state?
Although you don't actually answer that question, you answer a
different question:
[snip]
> approximations: I considered that "vandalism" is that thing which
> gets reverted, and that "reverts" are those edits tagged with "revert,
> rv, undo, undid, etc." in the edit summary line. Obviously, not all
> vandalism is cleanly reverted, and not all reverts are cleanly tagged.
Which is interesting too, but part of the problem with calling this a
measure of vandalism is that it isn't really, and we don't really have
a good handle on how solid an approximation it is beyond gut feelings
and arm-waving.
The study of Wikipedia activity is a new area of research, not
something that has been studied for decades. Not only do we not know
many things about Wikipedia, but we don't know many things about how
to know things about Wikipedia.
There must be ways to get a better understanding, but we many not know
of them and the ones we do know of are not always used. For example,
we could increase our confidence in this type of proxy-measure by
taking a random subset of that data and having humans classify it
based on some agreed-on established criteria. By performing the review
process many times we could get a handle on the typical error of both
the proxy-metric and the meta-review.
The risk here is that people will misunderstand these shorthand
metrics as the real-deal and the risk is increased when we encourage
it by using language which suggests that the simplistic understanding
is the correct one. IMO, highly uncertain and/or outright wrong
information is worse than not knowing when you aren't aware of the
reliability of the information.
We can't control how the press chooses to report on research, but when
we actively encourage misunderstandings by playing up the significance
or generality of our research our behaviour is unethical. Vigilance is
required.
This risk of misinformation is increased many-fold in comparative
analysis, where factors like time are plotted against indicators
because we often miss confounding variables
(http://en.wikipedia.org/wiki/Confounding).
Stepping away from your review for a moment, because it wasn't
primarily a comparative one, I'd like to point out some general
points:
For example, If research finds that edits are more frequently reverted
over time is this because there has been a change in the revision
decision process or have articles become better and more complete over
time and have edits to long and high quality articles always been more
likely to be reverted? Both are probably true, but how does the
contribution break down?
There are many other possibly significant confounding variables.
Probably many more than any of us have thought of yet.
I've always been of the school of thought that we do research to
produce understanding, not just generate numbers and "Wikipedia
becomes more complete over time, less work for new people to do" is a
pretty different understanding from "Wikipedia increasing hostile
towards new contributors" are pretty different understandings but both
may be supported by the same data at least until you've controlled for
many factors.
Another example— because of the scale of Wikipedia we must resort to
proxy-metrics. We can't directly measure vandalism, but we can measure
how often someone adds "is gay" over time. Proxy-metrics are powerful
tools but can be misleading. If we're trying to automatically
identify vandalism for a study (either to include it or exclude it) we
have the risk that the vandals are adapting to automatic
identification: If you were using "is gay" as a measure of vandalism
over time you might conclude that vandalism is decreasing when in
reality "cluebot" is performing the same kind of analysis for its
automatic vandalism suppression and the vandals have responded by
vandalizing in forms that can't be automatically identified, such as
by changing dates to incorrect values.
Or, keeping the goal of understanding in mind, sometimes the
measurements can all be right but a lack of care and consideration can
still cause people to draw the wrong conclusions. For example,
English Wikipedia has adopted a much stronger policy about citations
in articles about living people than it once had. It is
*intentionally* more difficult to contribute to those articles
especially for new contributors who do not know the rules then it once
was.
Going back to your simple study now: The analysis of vandalism
duration and its impact on readers makes an assumption about
readership which we know to be invalid. You're assuming a uniform
distribution of readership: That readers are just as likely to read
any random article. But we know that the actual readership follows a
power-law (long-tail) distribution. Because of the failure to consider
traffic levels we can't draw conclusions on how much vandalism readers
are actually exposed to.
Interestingly— you've found a power-law distribution in vandalism
lifetime. Is it possible that readership and vandalism life are
correlated, that more widely read articles tend to get reverted
faster? That doesn't sound unreasonable to me and if it's true it
means that readers are exposed to far less vandalism than a uniform
model would suggest.
In any case— I don't say any of this to criticize the mechanics of
your work. I'm able to point these things out because you were clear
about what you measured, more so than some other analysis has been
(including my own, at times). But I do think that it's important that
we are careful to not describe our work in ways that will cause laymen
to over-generalize and that we keep in mind that the most readers are
not researchers, and that they desperately want the kind of pat
open-and-shut answers that we won't be able to even begin providing
until the study of Wikipedia is far better understood.
Likewise, users of Wikipedia research should be forewarned that
researchers are apt to use simple words like "vandalism" when they are
really measuring something far more specific and that surprising
correlations between what is actually being measured and the things it
is being measured against may produce misleading conclusions.
Cheers!
More information about the wikimedia-l
mailing list