Hi, Just something that occurs to me as I write up my dissertation - I keep on thinking it would be nice to be able to cite some basic figures to back up a point I am making, eg. how many times Wikipedia is edited on a given day or how many pages link to this policy page - as I asked in an email to the wikipedia-l list, which has mysteriously vanished from the archives (August 11, entitled "What links here?"). I realise these could be done by going to the recent changes or special pages and counting them all, but I'm basically too lazy to do that - we're talking about thousands of pages here, right? I'm also thinking this is something that many people would be interested in finding out and writing about. So what I'm asking is that to help researchers generally, wouldn't it be an idea to identify some quick database hacks that we could provide - almost like a kate's tools function? Or are these available on the MediaWiki pages? If they are, and I've looked at some database related pages, they're certainly not so understandable from the perspective of someone who just wants to use basic functions. You might be thinking of sending me to a page like http://meta.wikimedia.org/wiki/Links_table - but *what does it mean?* Can someone either help me out, or suggest what we could do about this in the future?
Cheers, Cormac
Cormac Lawler wrote:
Just something that occurs to me as I write up my dissertation - I keep on thinking it would be nice to be able to cite some basic figures to back up a point I am making, eg. how many times Wikipedia is edited on a given day or how many pages link to this policy page - as I asked in an email to the wikipedia-l list, which has mysteriously vanished from the archives (August 11, entitled "What links here?"). I realise these could be done by going to the recent changes or special pages and counting them all, but I'm basically too lazy to do that.
I'm doing different statistics of Wikipedia data for month. Not every data is available but there is *a lot* It's much more to analyse than I can do in my time. You can answer a lot of questions with the database dumps (recently changed to XML) and python mediawiki framework but that means you have to dig into the data models and programming.
we're talking about thousands of pages here, right? I'm also thinking this is something that many people would be interested in finding out and writing about. So what I'm asking is that to help researchers generally, wouldn't it be an idea to identify some quick database hacks that we could provide - almost like a kate's tools function? Or are these available on the MediaWiki pages?
The only solution is to share your code and data and to frequently publicate results. That's how research works isn't it?. I'm very interested to have a special server for Wikimetrics but someone has to admin it (getting the hardware is not such a problem). For instance I could parse the version history dump to select article, user and timestamp only so other people can analyse which articles are edited at which days or vice versa but I just don't have a server to handle Gigabytes of data. Up to know I only managed to set up a Data Warehouse for Personendaten (http://wdw.sieheauch.de/) but - like most of what's already done - mostly undocumented :-(
If they are, and I've looked at some database related pages, they're certainly not so understandable from the perspective of someone who just wants to use basic functions. You might be thinking of sending me to a page like http://meta.wikimedia.org/wiki/Links_table - but *what does it mean?* Can someone either help me out, or suggest what we could do about this in the future?
1.) collect the questions, define what exacly you want (for instance "number of articles edited at each day") 2.) collect ways to answer them ("extract data X from Y and calculate Z") 3.) find someone who does it
Well, it sounds like work ;-)
Greetings, Jakob
On 8/21/05, Jakob Voss jakob.voss@nichtich.de wrote:
Cormac Lawler wrote:
Just something that occurs to me as I write up my dissertation - I keep on thinking it would be nice to be able to cite some basic figures to back up a point I am making, eg. how many times Wikipedia is edited on a given day or how many pages link to this policy page - as I asked in an email to the wikipedia-l list, which has mysteriously vanished from the archives (August 11, entitled "What links here?"). I realise these could be done by going to the recent changes or special pages and counting them all, but I'm basically too lazy to do that.
I'm doing different statistics of Wikipedia data for month. Not every data is available but there is *a lot* It's much more to analyse than I can do in my time. You can answer a lot of questions with the database dumps (recently changed to XML) and python mediawiki framework but that means you have to dig into the data models and programming.
I'm certainly not averse to doing some work ;) and I'd happy to look into this as long as there is some form of clear instructions for doing it. That's primarily what I'm interested in.
we're talking about thousands of pages here, right? I'm also thinking this is something that many people would be interested in finding out and writing about. So what I'm asking is that to help researchers generally, wouldn't it be an idea to identify some quick database hacks that we could provide - almost like a kate's tools function? Or are these available on the MediaWiki pages?
The only solution is to share your code and data and to frequently publicate results. That's how research works isn't it?. I'm very interested to have a special server for Wikimetrics but someone has to admin it (getting the hardware is not such a problem). For instance I could parse the version history dump to select article, user and timestamp only so other people can analyse which articles are edited at which days or vice versa but I just don't have a server to handle Gigabytes of data. Up to know I only managed to set up a Data Warehouse for Personendaten (http://wdw.sieheauch.de/) but - like most of what's already done - mostly undocumented :-(
It'd be very interesting to see details of your data and methodology - I'm sure that's something that will be of incredible value as we move research forward on Wikipedia. But not just as in a paper where normally you will say "I retrieved this data from an SQL dump of the database" and then do things with the data, what I am looking for, to repeat, is *how you actually do this* from another researcher's point of view.
If they are, and I've looked at some database related pages, they're certainly not so understandable from the perspective of someone who just wants to use basic functions. You might be thinking of sending me to a page like http://meta.wikimedia.org/wiki/Links_table - but *what does it mean?* Can someone either help me out, or suggest what we could do about this in the future?
1.) collect the questions, define what exacly you want (for instance "number of articles edited at each day") 2.) collect ways to answer them ("extract data X from Y and calculate Z") 3.) find someone who does it
Well, it sounds like work ;-)
1,2 and 3 should either be written up on m:Wikimedia Research Network or a sub page of m:Research. As for the ongoing work actually on this area, I'll be taking a quantitative research module as part of my latest masters and I'll happily intertwine any project we deem fitting/necessary with my project for that module. Just have to finish off my current masters first, which means that my wikiworkload has to be put on hold for about two weeks.
Greetings, Jakob
Thanks, Cormac
On Aug 21, 2005, at 2:09 PM, Cormac Lawler wrote:
It'd be very interesting to see details of your data and methodology -
I agree with Cormac, I'd love to see your code and methodologies. Seeing the python scripts you use to process the XML could save some of us a lot of time. I started digging into python again after wikimania. Anything to shorten the learning curve would be *greatly* appreciated.
Thanks...Kevin
On 8/21/05, Kevin Gamble kevin_gamble@ncsu.edu wrote:
On Aug 21, 2005, at 2:09 PM, Cormac Lawler wrote:
It'd be very interesting to see details of your data and methodology - I agree with Cormac, I'd love to see your code and methodologies. Seeing the python scripts you use to process the XML could save some of us a lot of time. I started digging into python again after wikimania. Anything to shorten the learning curve would be *greatly* appreciated.
And I'm working on a statistics service similar to what's discussed here as a school project. Nothing to show yet, but I'm glad to hear there's interest. :)
Cormac Lawler wrote:
Jakob wrote:
The only solution is to share your code and data and to frequently publicate results. That's how research works isn't it?. I'm very interested to have a special server for Wikimetrics but someone has to admin it (getting the hardware is not such a problem). For instance I could parse the version history dump to select article, user and timestamp only so other people can analyse which articles are edited at which days or vice versa but I just don't have a server to handle Gigabytes of data. Up to know I only managed to set up a Data Warehouse for Personendaten (http://wdw.sieheauch.de/) but - like most of what's already done - mostly undocumented :-(
It'd be very interesting to see details of your data and methodology - I'm sure that's something that will be of incredible value as we move research forward on Wikipedia. But not just as in a paper where normally you will say "I retrieved this data from an SQL dump of the database" and then do things with the data, what I am looking for, to repeat, is *how you actually do this* from another researcher's point of view.
First I had to rewrite http://meta.wikimedia.org/wiki/Help:Export
Actually I parse the XML export with Joost. But this won't help you much at the moment: http://meta.wikimedia.org/wiki/User:Nichtich/Process_MediaWiki_XML_export
A physical workshop would be much more fruitful I think because it's a lot of work to write HOWTOs :-(
Greetings, Jakob
On 8/24/05, Jakob Voss jakob.voss@nichtich.de wrote:
A physical workshop would be much more fruitful I think because it's a lot of work to write HOWTOs :-(
There's no substitute for face-to-face, but waiting a year would be a shame. I am growing to love screencasts...
Cormac, puzzling why your Aug 11 post is missing.
As for the task of finding what links here, you could do the low tech hack of just sending a hand crafted URL that sends back the first 5000 links, like this query that finds out how many folks link to WP:POINT:
http://en.wikipedia.org/w/index.php?title=Special:Whatlinkshere&target=W...
I use 5000 since the last time I checked, the most any query will return is 5000 for db performance reasons. If there are more than 5000, alter the "offest" number to 1, then rinse, lather and repeat.
At least that way, you're not having to hack XML, SQL, Python.
If you know some shell scripting, then you can automate this somewhat with curl/wget to automate the fetching of these pages, then use some combo of grep/wc to actually find out how many user page, project pages, talk pages, etc link to policy pages.
-Andrew (User:Fuzheado)
On 8/21/05, Cormac Lawler cormaggio@gmail.com wrote:
Hi, Just something that occurs to me as I write up my dissertation - I keep on thinking it would be nice to be able to cite some basic figures to back up a point I am making, eg. how many times Wikipedia is edited on a given day or how many pages link to this policy page - as I asked in an email to the wikipedia-l list, which has mysteriously vanished from the archives (August 11, entitled "What links here?"). I realise these could be done by going to the recent changes or special pages and counting them all, but I'm basically too lazy to do that - we're talking about thousands of pages here, right? I'm also thinking this is something that many people would be interested in finding out and writing about. So what I'm asking is that to help researchers generally, wouldn't it be an idea to identify some quick database hacks that we could provide - almost like a kate's tools function? Or are these available on the MediaWiki pages? If they are, and I've looked at some database related pages, they're certainly not so understandable from the perspective of someone who just wants to use basic functions. You might be thinking of sending me to a page like http://meta.wikimedia.org/wiki/Links_table - but *what does it mean?* Can someone either help me out, or suggest what we could do about this in the future?
Cheers, Cormac _______________________________________________ Wiki-research-l mailing list Wiki-research-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wiki-research-l
On 8/21/05, Andrew Lih andrew.lih@gmail.com wrote:
Cormac, puzzling why your Aug 11 post is missing.
Beats the hell out of me.
As for the task of finding what links here, you could do the low tech hack of just sending a hand crafted URL that sends back the first 5000 links, like this query that finds out how many folks link to WP:POINT:
http://en.wikipedia.org/w/index.php?title=Special:Whatlinkshere&target=W...
I use 5000 since the last time I checked, the most any query will return is 5000 for db performance reasons. If there are more than 5000, alter the "offest" number to 1, then rinse, lather and repeat.
Thanks - laborious, but at least it's easy to actually accomplish
At least that way, you're not having to hack XML, SQL, Python.
Aarrrggh
If you know some shell scripting, then you can automate this somewhat with curl/wget to automate the fetching of these pages, then use some combo of grep/wc to actually find out how many user page, project pages, talk pages, etc link to policy pages.
-Andrew (User:Fuzheado)
This is all Klingon to me - is there an encyclopedia of that somewhere? :-)
Cormac
On 8/21/05, Cormac Lawler cormaggio@gmail.com wrote:
If you know some shell scripting, then you can automate this somewhat with curl/wget to automate the fetching of these pages, then use some combo of grep/wc to actually find out how many user page, project pages, talk pages, etc link to policy pages.
This is all Klingon to me - is there an encyclopedia of that somewhere? :-)
Looking back, it would have been useful to have a "research tools" section at Wikimania, during the days before the formal conference. Jakob V. and Erik Z. touched on some of this, by discussing some of the visualization tools and software tools they used, but it might be useful to have a step-by-step hands-on session for folks who don't know much about UNIX command line and text processing tools. We should definitely make it a part of WM2006.
Unfortunately, my suggestion of a roundtable for research didn't materialize. But we should keep the conversation going. Perhaps a tutorial on Meta on where to start when using Wikipedia data for research.
-Andrew (User:Fuzheado)
On 8/22/05, Andrew Lih andrew.lih@gmail.com wrote:
Looking back, it would have been useful to have a "research tools" section at Wikimania, during the days before the formal conference. Jakob V. and Erik Z. touched on some of this, by discussing some of the visualization tools and software tools they used, but it might be useful to have a step-by-step hands-on session for folks who don't know much about UNIX command line and text processing tools. We should definitely make it a part of WM2006.
Agreed.
Unfortunately, my suggestion of a roundtable for research didn't materialize. But we should keep the conversation going. Perhaps a tutorial on Meta on where to start when using Wikipedia data for research.
-Andrew (User:Fuzheado)
That would be fantastic. I'll draft some questions as per Jakob's request and will happily fill the role of dumb guinea pig to make sure the tutorial is as helpful as possible. But I'll have to hold on doing this for at least the next two weeks as this dissertation finally takes shape.
Thanks, Cormac
wiki-research-l@lists.wikimedia.org