On Thu, Jun 3, 2010 at 4:14 PM, Reid Priedhorsky reid@umn.edu wrote:
Brian J Mingus wrote:
---------- Forwarded message ---------- From: Brian Brian.Mingus@colorado.edu Date: Wed, Jun 2, 2010 at 10:46 PM Subject: Re: [Wiki-research-l] Quality and pageviews To: Liam Wyatt liamwyatt@gmail.com
Interestingly, the result is negative. The correlation coefficient
between
2500 featured articles and 2500 random articles is .18 which is very low.
I
also trained a linear classifier to predict the quality of an article
based
on the number of page views and it was no better than chance.
That reminds me of an incidental finding from our 2007 work: we wanted to use article edit rate to predict view rate, but there was no correlation between the two.
Reid
That is an interesting negative finding as well. Just so this thread doesn't go without some positive results, here is a table from one of my technical reports on some features that *do* correlate with quality. If the number is greater than zero it correlates with quality, if it is 0 it does not correlate, and if it is less than 0 it is negatively correlated with quality. The scale of the numbers is meaningless and not interpretable, although the relative magnitude is important. These are just the relative performance of each feature for each class, as extracted from the weights of a random forests classifier.
http://grey.colorado.edu/mediawiki/sites/mingus/images/1/1e/DeHoustMangalath...
Summary (features in order of predictive ability):
- *Featured* articles are *correlated* with Number of images, Number of external links, Automated Readability Index, Number of references, Number of internal links, Length of article HTML, Gunning Fog Index, Flesch-Kincaid Grade Level, Lesbarhedsindex Readability Formula, Number of words, Number of to be's, Number of sentences - Note that featured articles are easy to predict. - *A* articles are *correlated* with Number of references, PageRank, Number of external links, Number of images, Article age (page-id). - Note that A articles are extremely hard to predict. All of the above A predictors are weaker than all of the featured predictors. This class should be merged with another quality class. - *G *articles are *correlated* with Number of external links, Number of templates, Number of references, Automated Readability Index, Flesh-Kincaid Grade Level - *G* articles are *negatively correlated* with Length of article HTML, Flesch Reading Ease, Smog Grading - Note that G articles are extremely hard to predict and should be merged with another quality class. - *B* articles are *correlated* with Automated Readability Index, Flesch-Kincaid Grade Level, Laesbarhedsindex Readability Formula, Gunning Fog Index, Length of Article HTML, Number of paragraphs, Flesh Reading Ease, Smog Grading, Number of internal links, Number of words, Number of references, Number of to be's, Number of sentences, Coleman-Liau Index, Number of templates, PageRank, Number of external links, Number of relative links, Number of <h3>s, Number of interlanguage links - Note that B articles are very easy to predict. - *Start/Stub* were left out of this analysis because they are so easy to predict based on a lack of pretty much any useful information.
On Fri, Jun 4, 2010 at 8:16 AM, Brian Brian.Mingus@colorado.edu wrote:
On Thu, Jun 3, 2010 at 4:14 PM, Reid Priedhorsky reid@umn.edu wrote:
Brian J Mingus wrote:
---------- Forwarded message ---------- From: Brian Brian.Mingus@colorado.edu Date: Wed, Jun 2, 2010 at 10:46 PM Subject: Re: [Wiki-research-l] Quality and pageviews To: Liam Wyatt liamwyatt@gmail.com
Interestingly, the result is negative. The correlation coefficient
between
2500 featured articles and 2500 random articles is .18 which is very
low. I
also trained a linear classifier to predict the quality of an article
based
on the number of page views and it was no better than chance.
That reminds me of an incidental finding from our 2007 work: we wanted to use article edit rate to predict view rate, but there was no correlation between the two.
Reid
That is an interesting negative finding as well. Just so this thread doesn't go without some positive results, here is a table from one of my technical reports on some features that *do* correlate with quality. If the number is greater than zero it correlates with quality, if it is 0 it does not correlate, and if it is less than 0 it is negatively correlated with quality. The scale of the numbers is meaningless and not interpretable, although the relative magnitude is important. These are just the relative performance of each feature for each class, as extracted from the weights of a random forests classifier.
http://grey.colorado.edu/mediawiki/sites/mingus/images/1/1e/DeHoustMangalath...
Summary (features in order of predictive ability):
- *Featured* articles are *correlated* with Number of images, Number of
external links, Automated Readability Index, Number of references, Number of internal links, Length of article HTML, Gunning Fog Index, Flesch-Kincaid Grade Level, Lesbarhedsindex Readability Formula, Number of words, Number of to be's, Number of sentences - Note that featured articles are easy to predict.
- *A* articles are *correlated* with Number of references, PageRank,
Number of external links, Number of images, Article age (page-id). - Note that A articles are extremely hard to predict. All of the above A predictors are weaker than all of the featured predictors. This class should be merged with another quality class.
- *G *articles are *correlated* with Number of external links, Number
of templates, Number of references, Automated Readability Index, Flesh-Kincaid Grade Level
- *G* articles are *negatively correlated* with Length of article HTML,
Flesch Reading Ease, Smog Grading - Note that G articles are extremely hard to predict and should be merged with another quality class.
- *B* articles are *correlated* with Automated Readability Index,
Flesch-Kincaid Grade Level, Laesbarhedsindex Readability Formula, Gunning Fog Index, Length of Article HTML, Number of paragraphs, Flesh Reading Ease, Smog Grading, Number of internal links, Number of words, Number of references, Number of to be's, Number of sentences, Coleman-Liau Index, Number of templates, PageRank, Number of external links, Number of relative links, Number of <h3>s, Number of interlanguage links
- Note that B articles are very easy to predict.
- *Start/Stub* were left out of this analysis because they are so easy
to predict based on a lack of pretty much any useful information.
Single best predictor overall: Automated Readability Index http://en.wikipedia.org/wiki/Automated_Readability_Index
On Friday, June 04, 2010, Brian J Mingus wrote:
Single best predictor overall: Automated Readability Index http://en.wikipedia.org/wiki/Automated_Readability_Index
And that's intuitive as one of the more difficult human-type tasks is to try to write something clearly and concisely.
Brian J Mingus, 04/06/2010 16:17:
o Note that G articles are extremely hard to predict and should be merged with another quality class.
Or viceversa this is a useful class because of that, i.e. because gives infos that an automated algorithm can't give?
Nemo
On Sat, Jun 5, 2010 at 3:28 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Brian J Mingus, 04/06/2010 16:17:
o Note that G articles are extremely hard to predict and should be merged with another quality class.
Or viceversa this is a useful class because of that, i.e. because gives infos that an automated algorithm can't give?
Nemo
Do you have an example of the information that the raters use to classify good articles that I didn't look at? If so then we can automate it and try to classify them. There really is nothing that a human can do that we can't automate in some way. There are a few things that are out of reach right now, such as dependency parsing and discourse analysis including differentiating between clear and grammatically correct prose versus brilliant prose. Those things are already on the horizon, however, and might even be possible. Bottom line is the raters aren't using any sort of consistent methodology to classify the articles, including that they don't even follow their own guidelines nor the Wikipedia Manual of Style.
Cheers,
Brian J Mingus wrote:
o Note that featured articles are easy to predict. * *A* articles are /correlated/ with Number of references, PageRank, Number of external links, Number of images, Article age (page-id). o Note that A articles are extremely hard to predict. All of the above A predictors are weaker than all of the featured predictors. This class should be merged with another quality class.
I wonder if this is because most projects don't have a proper A-class grading scheme, and often editors will just give A-grade article to GA (or even pre-GA class article) without much thought.
I wonder if your results would be stronger if you limited the analysis to A-class articles of projects that have an A-class review process (MILHIST and...?).
* *G *articles are /correlated/ with Number of external links, Number of templates, Number of references, Automated Readability Index, Flesh-Kincaid Grade Level * *G* articles are /negatively correlated/ with Length of article HTML, Flesch Reading Ease, Smog Grading o Note that G articles are extremely hard to predict and should be merged with another quality class.
Interesting. I presume you mean Good Article (GA) class here. I wonder why those results are week - is it because Good Article reviewers standards vary much more widely than the Featured Article reviewers standards?
On Sat, Jun 5, 2010 at 12:16 AM, Brian J Mingus Brian.Mingus@colorado.edu wrote:
http://grey.colorado.edu/mediawiki/sites/mingus/images/1/1e/DeHoustMangalath...
Are you able to add 'no. of incoming internal links' ?
-- John Vandenberg
wiki-research-l@lists.wikimedia.org