On 8/21/05, Jakob Voss jakob.voss@nichtich.de wrote:
Cormac Lawler wrote:
Just something that occurs to me as I write up my dissertation - I keep on thinking it would be nice to be able to cite some basic figures to back up a point I am making, eg. how many times Wikipedia is edited on a given day or how many pages link to this policy page - as I asked in an email to the wikipedia-l list, which has mysteriously vanished from the archives (August 11, entitled "What links here?"). I realise these could be done by going to the recent changes or special pages and counting them all, but I'm basically too lazy to do that.
I'm doing different statistics of Wikipedia data for month. Not every data is available but there is *a lot* It's much more to analyse than I can do in my time. You can answer a lot of questions with the database dumps (recently changed to XML) and python mediawiki framework but that means you have to dig into the data models and programming.
I'm certainly not averse to doing some work ;) and I'd happy to look into this as long as there is some form of clear instructions for doing it. That's primarily what I'm interested in.
we're talking about thousands of pages here, right? I'm also thinking this is something that many people would be interested in finding out and writing about. So what I'm asking is that to help researchers generally, wouldn't it be an idea to identify some quick database hacks that we could provide - almost like a kate's tools function? Or are these available on the MediaWiki pages?
The only solution is to share your code and data and to frequently publicate results. That's how research works isn't it?. I'm very interested to have a special server for Wikimetrics but someone has to admin it (getting the hardware is not such a problem). For instance I could parse the version history dump to select article, user and timestamp only so other people can analyse which articles are edited at which days or vice versa but I just don't have a server to handle Gigabytes of data. Up to know I only managed to set up a Data Warehouse for Personendaten (http://wdw.sieheauch.de/) but - like most of what's already done - mostly undocumented :-(
It'd be very interesting to see details of your data and methodology - I'm sure that's something that will be of incredible value as we move research forward on Wikipedia. But not just as in a paper where normally you will say "I retrieved this data from an SQL dump of the database" and then do things with the data, what I am looking for, to repeat, is *how you actually do this* from another researcher's point of view.
If they are, and I've looked at some database related pages, they're certainly not so understandable from the perspective of someone who just wants to use basic functions. You might be thinking of sending me to a page like http://meta.wikimedia.org/wiki/Links_table - but *what does it mean?* Can someone either help me out, or suggest what we could do about this in the future?
1.) collect the questions, define what exacly you want (for instance "number of articles edited at each day") 2.) collect ways to answer them ("extract data X from Y and calculate Z") 3.) find someone who does it
Well, it sounds like work ;-)
1,2 and 3 should either be written up on m:Wikimedia Research Network or a sub page of m:Research. As for the ongoing work actually on this area, I'll be taking a quantitative research module as part of my latest masters and I'll happily intertwine any project we deem fitting/necessary with my project for that module. Just have to finish off my current masters first, which means that my wikiworkload has to be put on hold for about two weeks.
Greetings, Jakob
Thanks, Cormac