New subject: Features that correlate with quality (Was: Quality and pageviews)

4 Jun 2010


      On Thu, Jun 3, 2010 at 4:14 PM, Reid Priedhorsky reid@umn.edu wrote:
...
Brian J Mingus wrote:
...
---------- Forwarded message ----------
From: Brian Brian.Mingus@colorado.edu
Date: Wed, Jun 2, 2010 at 10:46 PM
Subject: Re: [Wiki-research-l] Quality and pageviews
To: Liam Wyatt liamwyatt@gmail.com
Interestingly, the result is negative. The correlation coefficient
between
...
2500 featured articles and 2500 random articles is .18 which is very low.
I
...
also trained a linear classifier to predict the quality of an article
based
...
on the number of page views and it was no better than chance.
That reminds me of an incidental finding from our 2007 work: we wanted
to use article edit rate to predict view rate, but there was no
correlation between the two.
Reid
That is an interesting negative finding as well. Just so this thread doesn't
go without some positive results, here is a table from one of my technical
reports on some features that *do* correlate with quality. If the number is
greater than zero it correlates with quality, if it is 0 it does not
correlate, and if it is less than 0 it is negatively correlated with
quality. The scale of the numbers is meaningless and not interpretable,
although the relative magnitude is important. These are just the relative
performance of each feature for each class, as extracted from the weights of
a random forests classifier.
http://grey.colorado.edu/mediawiki/sites/mingus/images/1/1e/DeHoustMangalath...
Summary (features in order of predictive ability):
- *Featured* articles are *correlated* with Number of images, Number of
   external links, Automated Readability Index, Number of references, Number of
   internal links, Length of article HTML, Gunning Fog Index, Flesch-Kincaid
   Grade Level, Lesbarhedsindex Readability Formula, Number of words, Number of
   to be's, Number of sentences
      - Note that featured articles are easy to predict.
   - *A* articles are *correlated* with Number of references, PageRank,
   Number of external links, Number of images, Article age (page-id).
      - Note that A articles are extremely hard to predict. All of the above
      A predictors are weaker than all of the featured predictors. This class
      should be merged with another quality class.
   - *G *articles are *correlated* with Number of external links, Number of
   templates, Number of references, Automated Readability Index, Flesh-Kincaid
   Grade Level
   - *G* articles are *negatively correlated* with Length of article HTML,
   Flesch Reading Ease, Smog Grading
      - Note that G articles are extremely hard to predict and should be
      merged with another quality class.
   - *B* articles are *correlated* with Automated Readability Index,
   Flesch-Kincaid Grade Level, Laesbarhedsindex Readability Formula, Gunning
   Fog Index, Length of Article HTML, Number of paragraphs, Flesh Reading Ease,
   Smog Grading, Number of internal links, Number of words, Number of
   references, Number of to be's, Number of sentences, Coleman-Liau Index,
   Number of templates, PageRank, Number of external links, Number of relative
   links, Number of <h3>s, Number of interlanguage links
   - Note that B articles are very easy to predict.
   - *Start/Stub* were left out of this analysis because they are so easy to
   predict based on a lack of pretty much any useful information.