Thanks for reading it. Articles that have high quality by the measures used
in that paper tend to score high among all dimensions of quality. These
dimensions are correlated not only with articles that are well written, but
articles that are correct. I'm not sure there is a good argument against the
point that Featured articles will tend to be more correct than A articles,
which will tend to be more correct than Good articles, etc... That comment
was made as more of an aside in the conclusion. These machine learning
algorithms aren't doing much more than searching for correlations. Thus, it
is no better at finding poorly written articles than correct articles than
any other thing you can imagine. It does not discover causation. It does
"reverse engineer" the human ratings, in the sense that it finds features
that correlate with them. Correctness likely correlates with quality, and
the number of references likely correlates with correctness, which is a
feature we included.
The distribution of the tags is skewed towards Start articles. If you train
a classifier on an un-normalized dataset, it will do the intelligent thing:
classify all articles as Start. Click the "Random page" link a couple of
dozen times and you can see that this is indeed a good way to get roughly
70% of the classifications correct. However, we removed the skew from our
dataset by using equal numbers of all classes, based on the number of A
articles, as the fewest number of these are in the encyclopedia. Thus, we
trained on 650 of each class of articles, and from this extremely limited
dataset, achieve decent performance.
Of course, this was only a class project, intended to be a proof of concept.
It is well known that Support Vector Machine classification consistently
outperforms other methods in the domain of text classification, and if we
were only interested in high numbers, we could have boosted them that way.
Cheers,
Brian
On Nov 8, 2007 5:27 PM, P. Birken <pbirken(a)gmail.com> wrote:
Erik should be able to help you. I read your paper and
your
conclusions and you might think about rewriting them. In particular,
correctness is not and cannot be evaluated by your method and
therefore, cannot point readers to articles that are most likely
correct, simply to articles that are wellwritten. Your measure of
accuracy of your method is also a bit dubious, since the tags are not
uniform (take two featured articles of different age and they will be
of very different quality) and recovering them to 100% is therefore
not a reasonable goal. However, I believe that you method is
reasonable to find articles that are badly written.
Bye,
Philipp
2007/11/8, Brian <Brian.Mingus(a)colorado.edu>du>:
Several collaborators and I are preparing to
expand on previous work to
automatically ascertain the quality of Wikipedia articles on the English
Wikipedia (presented at Wikimania '07 [0]). PageRank is Google's
hallmark
quality metric, and the foundation actually has
access to these numbers
through the Google Webmaster Tools website. If a foundation
representative
were to create a Google account and verify that
they were a "webmaster,"
they could download the PageRank for every article on the English
Wikipedia
in a convenient tabular format. This data would
likely serve as a
fantastic
predictor. I would also like to compare the
Google-computed PageRank to
the
PageRank computed via Wikipedia's internal
link structure. I don't see
any
privacy implications in releasing this data. It
also doesn't seem to
help
spammers much, as they already know the pages
that have a very high
PageRank, and we include rel="nofollow" on outbound links. Nonetheless,
I
would of course be willing to keep the data
private.
This would only take a few minutes if it were approved. Is anyone out
there
who has the power to make it happen?
Cheers :)
Brian
[0]
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMin…
_______________________________________________
Wikiquality-l mailing list
Wikiquality-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l
_______________________________________________
Wikiquality-l mailing list
Wikiquality-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l