---------- Forwarded message ----------
From: Brian <Brian.Mingus(a)colorado.edu>
Date: Wed, Jun 2, 2010 at 10:46 PM
Subject: Re: [Wiki-research-l] Quality and pageviews
To: Liam Wyatt <liamwyatt(a)gmail.com>
Interestingly, the result is negative. The correlation coefficient between
2500 featured articles and 2500 random articles is .18 which is very low. I
also trained a linear classifier to predict the quality of an article based
on the number of page views and it was no better than chance.
This seems to make sense. People tend to spend a lot of time on articles
which are their pet topics which we all know is negatively correlated with
the historicity of the topic, leading to better Pokemon articles etc.. than
we have on important historical figures in many cases. This result could
probably be more fleshed out by looking at the correlation between the
Wikipedia Editorial Team's ratings of quality and importance.
It turns out that this is easy to do visually. Here is their summary table.
Many of the most important articles are of low quality, although the
converse is not really true. Thus there is some median threshold of
importance a topic must get over before someone will consider it a good pet
project. That's my take on your question. Cheers.
All rated articles by quality and importance *Quality* *Importance*
*Top<http://en.wikipedia.org/wiki/Category:Top-importance_articles>
* *High <http://en.wikipedia.org/wiki/Category:High-importance_articles>* *
Mid <http://en.wikipedia.org/wiki/Category:Mid-importance_articles>*
*Low<http://en.wikipedia.org/wiki/Category:Low-importance_articles>
* *??? <http://en.wikipedia.org/wiki/Category:Unassessed-Class_articles>* *
Total* *FA <http://en.wikipedia.org/wiki/Category:FA-Class_articles>*
672982852415232
*3,153* *FL <http://en.wikipedia.org/wiki/Category:FL-Class_articles>*
101393332401168
*1,395* *A <http://en.wikipedia.org/wiki/Category:A-Class_articles>*
84 26526212217
*750* *GA <http://en.wikipedia.org/wiki/Category:GA-Class_articles>*
8011,9003,1122,554912
*9,279* *B <http://en.wikipedia.org/wiki/Category:B-Class_articles>*
7,20615,24623,45517,35812,252
*75,517* *C <http://en.wikipedia.org/wiki/Category:C-Class_articles>*
3,0238,77918,29218,27813,137
*61,509* *Start <http://en.wikipedia.org/wiki/Category:Start-Class_articles>
* 7,803 37,581 144,561 248,669 169,614 *608,228*
*Stub<http://en.wikipedia.org/wiki/Category:Stub-Class_articles>
* 2,536 19,020 122,322 638,946 791,594 *1,574,418*
*List<http://en.wikipedia.org/wiki/Category:List-Class_articles>
* 1,284 3,537 8,739 20,067 19,039 *52,666* *Assessed* 23,510 87,703
321,927946,8101,006,965
*2,386,915*
*Unassessed<http://en.wikipedia.org/wiki/Category:Unassessed-Class_articles>
* 260 988 3,264 15,253 362,711 *382,476* *Total* *23,770* *88,691* *325,191*
*962,063* *1,369,676* *2,769,391*
On Wed, Jun 2, 2010 at 9:36 AM, Brian <Brian.Mingus(a)colorado.edu> wrote:
Alright I will run it. It will probably be this
evening before it is
finished. I'll let you know the results.
On Wed, Jun 2, 2010 at 9:30 AM, Liam Wyatt <liamwyatt(a)gmail.com> wrote:
I think you are greatly overestimating my
technical abilities...
I don't even know where to find python and scipy and even if I did, I
wouldn't know what to do with them to make them "go". I'm a
historian-wikipedian, not a techie-wikipedian :-)
Nevertheless, I will link out to your script in the blogpost I'm currently
writing for those who may wish to run it themselves.
All the best,
-Liam
wittylama.com/blog
Peace, love & metadata
On 2 June 2010 16:24, Brian J Mingus <Brian.Mingus(a)colorado.edu> wrote:
Hey, all you have to do is install Python and
SciPy, save the script to
somefilename.py, and then run python somefilename.py.
On Sat, May 29, 2010 at 5:17 PM, Liam Wyatt <liamwyatt(a)gmail.com> wrote:
Ok - your reasoning is sound. I'm happy that
the "FA of the day" won't
distort the results :-) Thanks.
On another note however, now is the first time I've been able to look at
your link (before I was on my phone) and I must admit I have no idea what to
do with it. I'm no coder so I don't know how to go from
http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Corre…
an actual result... Can you give me a hint?
-Liam
wittylama.com/blog
Peace, love & metadata
On 29 May 2010 16:50, Brian J Mingus <Brian.Mingus(a)colorado.edu> wrote:
> I think this technique is perfectly scientific. If you sample 2000
> featured and 2000 random articles there is a 1.5% chance (30/2000) that the
> featured article was on the front page last month. If you like, you can make
> a list and exclude them. But the effect is very small.
>
>
> On Sat, May 29, 2010 at 10:24 AM, Brian <Brian.Mingus(a)colorado.edu>wrote;wrote:
>
>> I only looked at last month so it is not sensitive to the article
>> being on the front page except by chance. Looking at Good articles is not a
>> good idea, you will get a very weak correlation.
>>
>>
>> On Sun, May 30, 2010 at 5:31 AM, Liam Wyatt <liamwyatt(a)gmail.com>wrote;wrote:
>>
>>> Actually Brian, does your tool take account of the unusual spike in
>>> pageviews that a featured article receives when it's FA of the day?
I'd like
>>> to be able to say that the article receives more views even without special
>>> events like that. Is it possible to. Remove that day for each of the FAs
>>> from the stats. Or, alternatively, can the script be run looking for Good
>>> Articles rather than FAs as these don't get special publicity but are
still
>>> better quality than the average?
>>>
>>> -Liam
>>>
>>>
>>>
>>> On 29/05/2010, at 2:48, Brian J Mingus <Brian.Mingus(a)Colorado.EDU>
>>> wrote:
>>>
>>>
>>>
>>> On Fri, May 28, 2010 at 6:51 PM, Brian <
<Brian.Mingus(a)colorado.edu>
>>> Brian.Mingus(a)colorado.edu> wrote:
>>>
>>>>
>>>>
>>>> On Fri, May 28, 2010 at 5:31 PM, Liam Wyatt <
<liamwyatt(a)gmail.com>
>>>> liamwyatt(a)gmail.com> wrote:
>>>>
>>>>> Does anyone know of any research which demonstrates a correlation
>>>>> between the quality of an article in Wikipedia and the number of
pageviews
>>>>> it receives?
>>>>> I'm trying to argue, as part of my work with museums, that if
they
>>>>> want to get people to come to their own website from wikipedia, then
instead
>>>>> of focusing on adding top-level links to the "external
links" section, they
>>>>> should instead focus on adding deep-linked footnotes and encouraging
the
>>>>> improvement of the quality of the article in general. This is on the
basis
>>>>> that "increased quality=increased pageviews=increased
clickthroughs". A
>>>>> win-win situation.
>>>>>
>>>>> I have some very nice statistics of pageviews (from
>>>>> <http://stats.grok.se>stats.grok.se) that match the
clickthrough
>>>>> stats from a museum (from their analytics) that clearly demonstrate
a
>>>>> correlation between increased pageviews and increased clickthroughs.
>>>>> However, I'm still looking for some scientific data to prove the
link
>>>>> between improved article quality and an increase in the number of
views of
>>>>> that article.
>>>>>
>>>>> Can anyone help?
>>>>>
>>>>> -Liam
>>>>
>>>>
>>>> An easy way to pull this off looks like this:
>>>>
>>>> - Create a list of all the featured articles
>>>> - Use <http://stats.grok.se>http://stats.grok.se to count their
>>>> pageviews
>>>> - Generate a random list of articles
>>>> - Check their pageviews
>>>> - Compute correlation between pageviews / quality
>>>>
>>>> You can also do a slightly more detailed analysis by using the
>>>> Wikipedia Editorial Team's ratings, but caveat emptor, their rating
system
>>>> is not satistifactory as I've shown in a couple of papers.
Nonetheless, if
>>>> you group together Featured/A/B articles and Start/Stub articles into
two
>>>> groups, the result should be solid.
>>>>
>>>>
>>>> - DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving
>>>> search in Wikipedia through quality and concept discovery*.
>>>> Technical Report.
PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/6/68/DeHoustM…
>>>> - Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the
>>>> feasibility of automatically rating online article quality*.
>>>> Technical Report.
PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/d/d3/Rassbach…
>>>>
>>>>
>>> Liam,
>>>
>>> I've written you a quick bot that computes the correlation
>>> coefficient between quality and pageviews for random samples of featured and
>>> random articles in April. It requires Python and SciPy. You can find it
>>> here:
>>>
>>>
>>>
<http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Correlation_Coefficient>
>>>
http://grey.colorado.edu/mingus/index.php/Wikipedia_Quality_Pageviews_Corre…
>>>
>>> I suggest that you cross validate the result. In other words, run the
>>> script several times and then average the results. It's rather slow,
owing
>>> to the slowness of Wikimedia properties. I personally didn't run it for
more
>>> than 10 featured and 10 random articles, but I set those numbers to 2000 for
>>> you. It will take hours to run.
>>>
>>> Cheers,
>>> Brian
>>>
>>>
>>>
>>>
>>>
>>
>