[Foundation-l] Database dumps and statistics

Sage Ross ragesoss+wikipedia at gmail.com
Tue Oct 9 02:18:31 UTC 2007


The Foundation needs to put a higher priority on numerical metrics,
and the database dumps from which they are derived.

We see the consequences of neglecting regular enwiki dumps in the
academic studies that come out about Wikipedia.  The most recent
study, "<a href="http://www-users.cs.umn.edu/~reid/papers/group282-priedhorsky.pdf">Creating,
Destroying, and Restoring Value in Wikipedia</a>", used year-old data
(the same dump used by the last several studies).  This greatly limits
the relevance of the results, given how much English Wikipedia has
changed in the last year.

Erik Zachte hasn't been able to provide updated statistics for enwiki
since last October.  Understanding the up-to-date structure of the
editing community, especially in conjunction with the increasingly
sophisticated analyses that computer scientists are producing, can be
a great asset for Wikimedians trying to manage the problems of scale
that projects are now facing.  But without good, recent statistics, we
lose the opportunity to take full advantage of such research.

The fact that the study above relied on specially-requested log data
to calculate per-page view rates is even more troubling; article hit
counters are desperately needed.  One benefit of hit counts, among
many, is to entice experts to edit by showing them just how many
people read the (possibly sub-par) articles related to their
expertise.  Demonstrating the readership levels of important political
and public policy topics would also be helpful in grant application.
Greg Maxwell informs me that he has written a hit counting tool that
(unlike the standard MediaWiki counter) could be enabling without a
detrimental performance hit.  I hope this can be implemented as soon
as possible.

Yours in discourse,
Sage (User:Ragesoss)



More information about the foundation-l mailing list