Thanks for reading it. Articles that have high quality by the measures used in that paper  tend to score high among all dimensions of quality. These dimensions are correlated not only with articles that are well written, but articles that are correct. I'm not sure there is a good argument against the point that Featured articles will tend to be more correct than A articles, which will tend to be more correct than Good articles, etc... That comment was made as more of an aside in the conclusion. These machine  learning algorithms aren't doing much more than searching for correlations. Thus, it is no better at finding poorly written articles than correct articles than any other thing you can imagine. It does not discover causation. It does "reverse engineer" the human ratings, in the sense that it finds features that correlate with them. Correctness likely correlates with quality, and the number of references likely correlates with correctness, which is a feature we included.

The distribution of the tags is skewed towards Start articles. If you train a classifier on an un-normalized dataset, it will do the intelligent thing: classify all articles as Start. Click the "Random page" link a couple of dozen times and you can see that this is indeed a good way to get roughly 70% of the classifications correct. However, we removed the skew from our dataset by using equal numbers of all classes, based on the number of A articles, as the fewest number of these are in the encyclopedia. Thus, we trained on 650 of each class of articles, and from this extremely limited dataset, achieve decent performance.

Of course, this was only a class project, intended to be a proof of concept. It is well known that Support Vector Machine classification consistently outperforms other methods in the domain of text classification, and if we were only interested in high numbers, we could have boosted them that way.


On Nov 8, 2007 5:27 PM, P. Birken <> wrote:
Erik should be able to help you. I read your paper and your
conclusions and you might think about rewriting them. In particular,
correctness is not and cannot be evaluated by your method and
therefore, cannot point readers to articles that are most likely
correct, simply to articles that are wellwritten. Your measure of
accuracy of your method is also a bit dubious, since the tags are not
uniform (take two featured articles of different age and they will be
of very different quality) and recovering them to 100% is therefore
not a reasonable goal. However, I believe that you method is
reasonable to find articles that are badly written.



2007/11/8, Brian < >:
> Several collaborators and I are preparing to expand on previous work to
> automatically ascertain the quality of Wikipedia articles on the English
> Wikipedia (presented at Wikimania '07 [0]). PageRank is Google's hallmark
> quality metric, and the foundation actually has access to these numbers
> through the Google Webmaster Tools website. If a foundation representative
> were to create a Google account and verify that they were a "webmaster,"
> they could download the PageRank for every article on the English Wikipedia
> in a convenient tabular format. This data would likely serve as a fantastic
> predictor. I would also like to compare the Google-computed PageRank to the
> PageRank computed via Wikipedia's internal link structure. I don't see any
> privacy implications in releasing this data. It also doesn't seem to help
> spammers much, as they already know the pages that have a very high
> PageRank, and we include rel="nofollow" on outbound links. Nonetheless, I
> would of course be willing to keep the data private.
> This would only take a few minutes if it were approved. Is anyone out there
> who has the power to make it happen?
> Cheers :)
> Brian
> [0]
> _______________________________________________
> Wikiquality-l mailing list

Wikiquality-l mailing list