[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

Gregory Maxwell gmaxwell at gmail.com
Thu Aug 20 17:18:51 UTC 2009


On Thu, Aug 20, 2009 at 12:46 PM, Jimmy Wales<jwales at wikia-inc.com> wrote:
[snip]
> Greg, I think your email sounded a little negative at the start, but not
> so much further down.  I think you would join me heartily in being super
> grateful for people doing this kind of analysis.  Yes, some of it will
> be primitive and will suffer from the many difficulties.  But
> data-driven decisionmaking is a great thing, particularly when we are
> cognizant of the limitations of the data we're using.
>
> I just didn't want anyone to get the idea (and I'm sure I'm reading you
> right) that you were opposed to people doing research. :-)


Absolutely— No one who has done thing kind of analysis could fail to
appreciate the enormous amount of work that goes into even making a
couple of simple seemingly "off the cuff" numbers out of the mountain
of data that is Wikipedia.

Making sure the numbers are accurate and meaningful while also clearly
explaining the process of generating is in and of itself a large
amount of work, and my gratitude is extended to anyone who contributes
to those processes.

I've long been a loud proponent of data driven decision making. So I'm
absolutely not opposed to people doing research, but just as you said—
we need to be acutely aware of the limitations of the research.  Weak
data is clearly better than no data, but only when you are aware of
the strength of the data.  Or, in other words, knowing what you don't
know is often *the most critical* piece of information in any decision
making process.

In our eagerness to establish what we can and do know it can be easy
to forget how much we don't know. Some of the limitations which are
all too obvious to researchers are less than obvious to people who've
never personally done quantitative analysis on Wikipedia data, yet
many of these people are the decision makers that must do something
useful with the data. The casual language used when researchers write
for researchers can magnify misunderstandings.  It was merely my
intent to caution against the related risks.

I think the most impactful contributions available for researchers
today are less in the area of the direct research itself but are
instead in advancing the art of researching Wikipedia.  But the two go
hand in hand, we can't advance the art if we don't do the research.
The latter type is less sexy and not prone to generating headlines,
but it is work that will last and generate citations for a long time.
Measurements of X today will be soon forgotten as they are replaced by
later analysis of the historical data using superior techniques.

That my tone was somewhat negative is only due to my extreme
disappointment in that our own discussion of recent measurements has
been almost entirely devoid of critical analysis. Contributors patting
themselves on the back and saying "I told you so!" seem to be
outnumbering suggestions that the research might mean something else
entirely, though perhaps that is my own bias speaking.   To the extent
that I'm wrong about that I hope that my comments were merely
redundant, to the extent that I'm right I hope my points will invite
nuanced understanding of the research and encourage people to seek out
and expose potentially confounding variables and bad-proxies so that
all our knowledge can be advanced.

If this stuff were easy it would all be done already. Wikipedia
research is interesting because it is both hard and potentially
meaningful. There is room and need for contributions from everyone.

Cheers!



More information about the foundation-l mailing list