On 8/21/05, Jakob Voss <jakob.voss(a)nichtich.de> wrote:
Cormac Lawler wrote:
Just something that occurs to me as I write up my
dissertation - I
keep on thinking it would be nice to be able to cite some basic
figures to back up a point I am making, eg. how many times Wikipedia
is edited on a given day or how many pages link to this policy page -
as I asked in an email to the wikipedia-l list, which has mysteriously
vanished from the archives (August 11, entitled "What links here?"). I
realise these could be done by going to the recent changes or special
pages and counting them all, but I'm basically too lazy to do that.
I'm doing different statistics of Wikipedia data for month. Not every
data is available but there is *a lot* It's much more to analyse than I
can do in my time. You can answer a lot of questions with the database
dumps (recently changed to XML) and python mediawiki framework but that
means you have to dig into the data models and programming.
I'm certainly not averse to doing some work ;) and I'd happy to look
into this as long as there is some form of clear instructions for
doing it. That's primarily what I'm interested in.
we're talking about thousands of pages here,
right? I'm also thinking
this is something that many people would be interested in finding out
and writing about. So what I'm asking is that to help researchers
generally, wouldn't it be an idea to identify some quick database
hacks that we could provide - almost like a kate's tools function?
Or are these available on the MediaWiki pages?
The only solution is to share your code and data and to frequently
publicate results. That's how research works isn't it?. I'm very
interested to have a special server for Wikimetrics but someone has to
admin it (getting the hardware is not such a problem). For instance I
could parse the version history dump to select article, user and
timestamp only so other people can analyse which articles are edited at
which days or vice versa but I just don't have a server to handle
Gigabytes of data. Up to know I only managed to set up a Data Warehouse
for Personendaten (
http://wdw.sieheauch.de/) but - like most of what's
already done - mostly undocumented :-(
It'd be very interesting to see details of your data and methodology -
I'm sure that's something that will be of incredible value as we move
research forward on Wikipedia. But not just as in a paper where
normally you will say "I retrieved this data from an SQL dump of the
database" and then do things with the data, what I am looking for, to
repeat, is *how you actually do this* from another researcher's point
of view.
If they are, and I've looked at some database
related pages, they're
certainly not so
understandable from the perspective of someone who just wants to use
basic functions. You might be thinking of sending me to a page like
http://meta.wikimedia.org/wiki/Links_table - but *what does it mean?*
Can someone either help me out, or suggest what we could do about this
in the future?
1.) collect the questions, define what exacly you want (for instance
"number of articles edited at each day")
2.) collect ways to answer them ("extract data X from Y and calculate Z")
3.) find someone who does it
Well, it sounds like work ;-)
1,2 and 3 should either be written up on m:Wikimedia Research Network
or a sub page of m:Research. As for the ongoing work actually on this
area, I'll be taking a quantitative research module as part of my
latest masters and I'll happily intertwine any project we deem
fitting/necessary with my project for that module. Just have to finish
off my current masters first, which means that my wikiworkload has to
be put on hold for about two weeks.
Greetings,
Jakob
Thanks,
Cormac