---------- Forwarded message ----------
From: Brian <Brian.Mingus@colorado.edu>
Date: Wed, Jun 2, 2010 at 10:46 PM
Subject: Re: [Wiki-research-l] Quality and pageviews
To: Liam Wyatt <liamwyatt@gmail.com>


Interestingly, the result is negative. The correlation coefficient between 2500 featured articles and 2500 random articles is .18 which is very low. I also trained a linear classifier to predict the quality of an article based on the number of page views and it was no better than chance. 

This seems to make sense. People tend to spend a lot of time on articles which are their pet topics which we all know is negatively correlated with the historicity of the topic, leading to better Pokemon articles etc.. than we have on important historical figures in many cases.  This result could probably be more fleshed out by looking at the correlation between the Wikipedia Editorial Team's ratings of quality and importance. 

It turns out that this is easy to do visually. Here is their summary table. Many of the most important articles are of low quality, although the converse is not really true. Thus there is some median threshold of importance a topic must get over before someone will consider it a good pet project. That's my take on your question. Cheers.

All rated articles by quality and importance
Quality Importance
Top High Mid Low ??? Total
 FA 672 982 852 415 232 3,153
 FL 101 393 332 401 168 1,395
 A 84 265 262 122 17 750
 GA 801 1,900 3,112 2,554 912 9,279
B 7,206 15,246 23,455 17,358 12,252 75,517
C 3,023 8,779 18,292 18,278 13,137 61,509
Start 7,803 37,581 144,561 248,669 169,614 608,228
Stub 2,536 19,020 122,322 638,946 791,594 1,574,418
List 1,284 3,537 8,739 20,067 19,039 52,666
Assessed 23,510 87,703 321,927 946,810 1,006,965 2,386,915
Unassessed 260 988 3,264 15,253 362,711 382,476
Total 23,770 88,691 325,191 962,063 1,369,676 2,769,391



On Wed, Jun 2, 2010 at 9:36 AM, Brian <Brian.Mingus@colorado.edu> wrote:
Alright I will run it. It will probably be this evening before it is finished. I'll let you know the results.


On Wed, Jun 2, 2010 at 9:30 AM, Liam Wyatt <liamwyatt@gmail.com> wrote:
I think you are greatly overestimating my technical abilities...
I don't even know where to find python and scipy and even if I did, I wouldn't know what to do with them to make them "go". I'm a historian-wikipedian, not a techie-wikipedian :-)

Nevertheless, I will link out to your script in the blogpost I'm currently writing for those who may wish to run it themselves.
All the best,
-Liam

wittylama.com/blog
Peace, love & metadata


On 2 June 2010 16:24, Brian J Mingus <Brian.Mingus@colorado.edu> wrote:
Hey, all you have to do is install Python and SciPy, save the script to somefilename.py, and then run python somefilename.py.


On Sat, May 29, 2010 at 5:17 PM, Liam Wyatt <liamwyatt@gmail.com> wrote:
Ok - your reasoning is sound. I'm happy that the "FA of the day" won't distort the results :-) Thanks.

On another note however, now is the first time I've been able to look at your link (before I was on my phone) and I must admit I have no idea what to do with it. I'm no coder so I don't know how to go from http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Correlation_Coefficient to an actual result...  Can you give me a hint?


-Liam

wittylama.com/blog
Peace, love & metadata


On 29 May 2010 16:50, Brian J Mingus <Brian.Mingus@colorado.edu> wrote:
I think this technique is perfectly scientific. If you sample 2000 featured and 2000 random articles there is a 1.5% chance (30/2000) that the featured article was on the front page last month. If you like, you can make a list and exclude them. But the effect is very small.


On Sat, May 29, 2010 at 10:24 AM, Brian <Brian.Mingus@colorado.edu> wrote:
I only looked at last month so it is not sensitive to the article being on the front page except by chance. Looking at Good articles is not a good idea, you will get a very weak correlation.


On Sun, May 30, 2010 at 5:31 AM, Liam Wyatt <liamwyatt@gmail.com> wrote:
Actually Brian, does your tool take account of the unusual spike in pageviews that a featured article receives when it's FA of the day? I'd like to be able to say that the article receives more views even without special events like that. Is it possible to. Remove that day for each of the FAs from the stats. Or, alternatively, can the script be run looking for Good Articles rather than FAs as these don't get special publicity but are still better quality than the average? 

-Liam 



On 29/05/2010, at 2:48, Brian J Mingus <Brian.Mingus@Colorado.EDU> wrote:



On Fri, May 28, 2010 at 6:51 PM, Brian <Brian.Mingus@colorado.edu> wrote:


On Fri, May 28, 2010 at 5:31 PM, Liam Wyatt <liamwyatt@gmail.com> wrote:
Does anyone know of any research which demonstrates a correlation between the quality of an article in Wikipedia and the number of pageviews it receives?
I'm trying to argue, as part of my work with museums, that if they want to get people to come to their own website from wikipedia, then instead of focusing on adding top-level links to the "external links" section, they should instead focus on adding deep-linked footnotes and encouraging the improvement of the quality of the article in general. This is on the basis that "increased quality=increased pageviews=increased clickthroughs". A win-win situation.

I have some very nice statistics of pageviews (from stats.grok.se) that match the clickthrough stats from a museum (from their analytics) that clearly demonstrate a correlation between increased pageviews and increased clickthroughs. However, I'm still looking for some scientific data to prove the link between improved article quality and an increase in the number of views of that article.

Can anyone help?

-Liam

An easy way to pull this off looks like this:

- Create a list of all the featured articles
- Use http://stats.grok.se to count their pageviews
- Generate a random list of articles
- Check their pageviews
- Compute correlation between pageviews / quality

You can also do a slightly more detailed analysis by using the Wikipedia Editorial Team's ratings, but caveat emptor, their rating system is not satistifactory as I've shown in a couple of papers. Nonetheless, if you group together Featured/A/B articles and Start/Stub articles into two groups, the result should be solid.

  • DeHoust, C., Mangalath, P., Mingus., B. (2008). Improving search in Wikipedia through quality and concept discovery. Technical Report. PDF
  • Rassbach, L., Mingus., B, Blackford, T. (2007). Exploring the feasibility of automatically rating online article quality. Technical Report. PDF

Liam, 

I've written you a quick bot that computes the correlation coefficient between quality and pageviews for random samples of featured and random articles in April. It requires Python and SciPy. You can find it here:


I suggest that you cross validate the result. In other words, run the script several times and then average the results. It's rather slow, owing to the slowness of Wikimedia properties. I personally didn't run it for more than 10 featured and 10 random articles, but I set those numbers to 2000 for you. It will take hours to run.

Cheers,
Brian