Fwd: Quality and pageviews - Wiki-research-l

3 Jun 2010


      ---------- Forwarded message ----------
From: Brian Brian.Mingus@colorado.edu
Date: Wed, Jun 2, 2010 at 10:46 PM
Subject: Re: [Wiki-research-l] Quality and pageviews
To: Liam Wyatt liamwyatt@gmail.com
Interestingly, the result is negative. The correlation coefficient between
2500 featured articles and 2500 random articles is .18 which is very low. I
also trained a linear classifier to predict the quality of an article based
on the number of page views and it was no better than chance.
This seems to make sense. People tend to spend a lot of time on articles
which are their pet topics which we all know is negatively correlated with
the historicity of the topic, leading to better Pokemon articles etc.. than
we have on important historical figures in many cases.  This result could
probably be more fleshed out by looking at the correlation between the
Wikipedia Editorial Team's ratings of quality and importance.
It turns out that this is easy to do visually. Here is their summary table.
Many of the most important articles are of low quality, although the
converse is not really true. Thus there is some median threshold of
importance a topic must get over before someone will consider it a good pet
project. That's my take on your question. Cheers.
All rated articles by quality and importance *Quality* *Importance*
*Tophttp://en.wikipedia.org/wiki/Category:Top-importance_articles
* *High http://en.wikipedia.org/wiki/Category:High-importance_articles* *
Mid http://en.wikipedia.org/wiki/Category:Mid-importance_articles*
*Lowhttp://en.wikipedia.org/wiki/Category:Low-importance_articles
* *??? http://en.wikipedia.org/wiki/Category:Unassessed-Class_articles* *
Total*  *FA http://en.wikipedia.org/wiki/Category:FA-Class_articles*
672982852415232
*3,153*  *FL http://en.wikipedia.org/wiki/Category:FL-Class_articles*
101393332401168
*1,395*  *A http://en.wikipedia.org/wiki/Category:A-Class_articles*
84 26526212217
*750*  *GA http://en.wikipedia.org/wiki/Category:GA-Class_articles*
8011,9003,1122,554912
*9,279* *B http://en.wikipedia.org/wiki/Category:B-Class_articles*
7,20615,24623,45517,35812,252
*75,517* *C http://en.wikipedia.org/wiki/Category:C-Class_articles*
3,0238,77918,29218,27813,137
*61,509* *Start http://en.wikipedia.org/wiki/Category:Start-Class_articles
* 7,803 37,581 144,561 248,669 169,614 *608,228*
*Stubhttp://en.wikipedia.org/wiki/Category:Stub-Class_articles
* 2,536 19,020 122,322 638,946 791,594 *1,574,418*
*Listhttp://en.wikipedia.org/wiki/Category:List-Class_articles
* 1,284 3,537 8,739 20,067 19,039 *52,666* *Assessed* 23,510 87,703
321,927946,8101,006,965
*2,386,915* *Unassessedhttp://en.wikipedia.org/wiki/Category:Unassessed-Class_articles
* 260 988 3,264 15,253 362,711 *382,476* *Total* *23,770* *88,691* *325,191*
*962,063* *1,369,676* *2,769,391*
On Wed, Jun 2, 2010 at 9:36 AM, Brian Brian.Mingus@colorado.edu wrote:
...
Alright I will run it. It will probably be this evening before it is
finished. I'll let you know the results.
On Wed, Jun 2, 2010 at 9:30 AM, Liam Wyatt liamwyatt@gmail.com wrote:
...
I think you are greatly overestimating my technical abilities...
I don't even know where to find python and scipy and even if I did, I
wouldn't know what to do with them to make them "go". I'm a
historian-wikipedian, not a techie-wikipedian :-)
Nevertheless, I will link out to your script in the blogpost I'm currently
writing for those who may wish to run it themselves.
All the best,
-Liam
wittylama.com/blog
Peace, love & metadata
On 2 June 2010 16:24, Brian J Mingus Brian.Mingus@colorado.edu wrote:
...
Hey, all you have to do is install Python and SciPy, save the script to
somefilename.py, and then run python somefilename.py.
On Sat, May 29, 2010 at 5:17 PM, Liam Wyatt liamwyatt@gmail.com wrote:
...
Ok - your reasoning is sound. I'm happy that the "FA of the day" won't
distort the results :-) Thanks.
On another note however, now is the first time I've been able to look at
your link (before I was on my phone) and I must admit I have no idea what to
do with it. I'm no coder so I don't know how to go from
http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Correl... an actual result...  Can you give me a hint?
-Liam
wittylama.com/blog
Peace, love & metadata
On 29 May 2010 16:50, Brian J Mingus Brian.Mingus@colorado.edu wrote:
...
I think this technique is perfectly scientific. If you sample 2000
featured and 2000 random articles there is a 1.5% chance (30/2000) that the
featured article was on the front page last month. If you like, you can make
a list and exclude them. But the effect is very small.
On Sat, May 29, 2010 at 10:24 AM, Brian Brian.Mingus@colorado.eduwrote:
...
I only looked at last month so it is not sensitive to the article
being on the front page except by chance. Looking at Good articles is not a
good idea, you will get a very weak correlation.
On Sun, May 30, 2010 at 5:31 AM, Liam Wyatt liamwyatt@gmail.comwrote:
> Actually Brian, does your tool take account of the unusual spike in
> pageviews that a featured article receives when it's FA of the day? I'd like
> to be able to say that the article receives more views even without special
> events like that. Is it possible to. Remove that day for each of the FAs
> from the stats. Or, alternatively, can the script be run looking for Good
> Articles rather than FAs as these don't get special publicity but are still
> better quality than the average?
>
> -Liam
>
>
>
> On 29/05/2010, at 2:48, Brian J Mingus Brian.Mingus@Colorado.EDU
> wrote:
>
>
>
> On Fri, May 28, 2010 at 6:51 PM, Brian < Brian.Mingus@colorado.edu
> Brian.Mingus@colorado.edu> wrote:
>
>>
>>
>> On Fri, May 28, 2010 at 5:31 PM, Liam Wyatt < liamwyatt@gmail.com
>> liamwyatt@gmail.com> wrote:
>>
>>> Does anyone know of any research which demonstrates a correlation
>>> between the quality of an article in Wikipedia and the number of pageviews
>>> it receives?
>>> I'm trying to argue, as part of my work with museums, that if they
>>> want to get people to come to their own website from wikipedia, then instead
>>> of focusing on adding top-level links to the "external links" section, they
>>> should instead focus on adding deep-linked footnotes and encouraging the
>>> improvement of the quality of the article in general. This is on the basis
>>> that "increased quality=increased pageviews=increased clickthroughs". A
>>> win-win situation.
>>>
>>> I have some very nice statistics of pageviews (from
>>> http://stats.grok.sestats.grok.se) that match the clickthrough
>>> stats from a museum (from their analytics) that clearly demonstrate a
>>> correlation between increased pageviews and increased clickthroughs.
>>> However, I'm still looking for some scientific data to prove the link
>>> between improved article quality and an increase in the number of views of
>>> that article.
>>>
>>> Can anyone help?
>>>
>>> -Liam
>>
>>
>> An easy way to pull this off looks like this:
>>
>> - Create a list of all the featured articles
>> - Use  http://stats.grok.sehttp://stats.grok.se to count their
>> pageviews
>> - Generate a random list of articles
>> - Check their pageviews
>> - Compute correlation between pageviews / quality
>>
>> You can also do a slightly more detailed analysis by using the
>> Wikipedia Editorial Team's ratings, but caveat emptor, their rating system
>> is not satistifactory as I've shown in a couple of papers. Nonetheless, if
>> you group together Featured/A/B articles and Start/Stub articles into two
>> groups, the result should be solid.
>>
>>
>>    - DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving
>>    search in Wikipedia through quality and concept discovery*.
>>    Technical Report. PDFhttp://grey.colorado.edu/mediawiki/sites/mingus/images/6/68/DeHoustMangalathMingus08.pdf
>>    - Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
>>    feasibility of automatically rating online article quality*.
>>    Technical Report. PDFhttp://grey.colorado.edu/mediawiki/sites/mingus/images/d/d3/RassbachPincockMingus07.pdf
>>
>>
> Liam,
>
> I've written you a quick bot that computes the correlation
> coefficient between quality and pageviews for random samples of featured and
> random articles in April. It requires Python and SciPy. You can find it
> here:
>
>
> http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Correlation_Coefficient
> http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Correl...
>
> I suggest that you cross validate the result. In other words, run the
> script several times and then average the results. It's rather slow, owing
> to the slowness of Wikimedia properties. I personally didn't run it for more
> than 10 featured and 10 random articles, but I set those numbers to 2000 for
> you. It will take hours to run.
>
> Cheers,
> Brian
>
>
>
>
>