Re: [Wikiquality-l] Using PageRank to ascertain quality (Foundation help needed!)

9 Nov 2007

Thanks for reading it. Articles that have high quality by the measures used
in that paper  tend to score high among all dimensions of quality. These
dimensions are correlated not only with articles that are well written, but
articles that are correct. I'm not sure there is a good argument against the
point that Featured articles will tend to be more correct than A articles,
which will tend to be more correct than Good articles, etc... That comment
was made as more of an aside in the conclusion. These machine  learning
algorithms aren't doing much more than searching for correlations. Thus, it
is no better at finding poorly written articles than correct articles than
any other thing you can imagine. It does not discover causation. It does
"reverse engineer" the human ratings, in the sense that it finds features
that correlate with them. Correctness likely correlates with quality, and
the number of references likely correlates with correctness, which is a
feature we included.

The distribution of the tags is skewed towards Start articles. If you train
a classifier on an un-normalized dataset, it will do the intelligent thing:
classify all articles as Start. Click the "Random page" link a couple of
dozen times and you can see that this is indeed a good way to get roughly
70% of the classifications correct. However, we removed the skew from our
dataset by using equal numbers of all classes, based on the number of A
articles, as the fewest number of these are in the encyclopedia. Thus, we
trained on 650 of each class of articles, and from this extremely limited
dataset, achieve decent performance.

Of course, this was only a class project, intended to be a proof of concept.
It is well known that Support Vector Machine classification consistently
outperforms other methods in the domain of text classification, and if we
were only interested in high numbers, we could have boosted them that way.

Cheers,
Brian

On Nov 8, 2007 5:27 PM, P. Birken &lt;pbirken(a)gmail.com&gt; wrote:

...
  Erik should be able to help you. I read your paper and
your
 conclusions and you might think about rewriting them. In particular,
 correctness is not and cannot be evaluated by your method and
 therefore, cannot point readers to articles that are most likely
 correct, simply to articles that are wellwritten. Your measure of
 accuracy of your method is also a bit dubious, since the tags are not
 uniform (take two featured articles of different age and they will be
 of very different quality) and recovering them to 100% is therefore
 not a reasonable goal. However, I believe that you method is
 reasonable to find articles that are badly written.

 Bye,

 Philipp

 2007/11/8, Brian &lt;Brian.Mingus(a)colorado.edu&gt;du>:
  Several collaborators and I are preparing to
expand on previous work to
 automatically ascertain the quality of Wikipedia articles on the English
 Wikipedia (presented at Wikimania '07 [0]). PageRank is Google's  hallmark
  quality metric, and the foundation actually has
access to these numbers
 through the Google Webmaster Tools website. If a foundation  representative
  were to create a Google account and verify that
they were a "webmaster,"
 they could download the PageRank for every article on the English  Wikipedia
  in a convenient tabular format. This data would
likely serve as a  fantastic
  predictor. I would also like to compare the
Google-computed PageRank to  the
  PageRank computed via Wikipedia's internal
link structure. I don't see  any
  privacy implications in releasing this data. It
also doesn't seem to  help
  spammers much, as they already know the pages
that have a very high
 PageRank, and we include rel="nofollow" on outbound links. Nonetheless, 
I
  would of course be willing to keep the data
private.

 This would only take a few minutes if it were approved. Is anyone out  there
  who has the power to make it happen?

 Cheers :)
 Brian

 [0]

http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMin…

 _______________________________________________
 Wikiquality-l mailing list
 Wikiquality-l(a)lists.wikimedia.org
 http://lists.wikimedia.org/mailman/listinfo/wikiquality-l

 _______________________________________________
 Wikiquality-l mailing list
 Wikiquality-l(a)lists.wikimedia.org
 http://lists.wikimedia.org/mailman/listinfo/wikiquality-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikiquality-l] Using PageRank to ascertain quality (Foundation help needed!)