Well, of course there is a correlation between quality of style and
correctness. But it is not that strong, that you could say that an
article that is well-written is probably correct. You can only deduce
that an article that is badly written is probably incorrect. This is
an important difference that cannot be stressed often enough.
Best wishes,
Philipp
2007/11/8, Brian <Brian.Mingus(a)colorado.edu>du>:
Thanks for reading it. Articles that have high quality
by the measures used
in that paper tend to score high among all dimensions of quality. These
dimensions are correlated not only with articles that are well written, but
articles that are correct. I'm not sure there is a good argument against the
point that Featured articles will tend to be more correct than A articles,
which will tend to be more correct than Good articles, etc... That comment
was made as more of an aside in the conclusion. These machine learning
algorithms aren't doing much more than searching for correlations. Thus, it
is no better at finding poorly written articles than correct articles than
any other thing you can imagine. It does not discover causation. It does
"reverse engineer" the human ratings, in the sense that it finds features
that correlate with them. Correctness likely correlates with quality, and
the number of references likely correlates with correctness, which is a
feature we included.
The distribution of the tags is skewed towards Start articles. If you train
a classifier on an un-normalized dataset, it will do the intelligent thing:
classify all articles as Start. Click the "Random page" link a couple of
dozen times and you can see that this is indeed a good way to get roughly
70% of the classifications correct. However, we removed the skew from our
dataset by using equal numbers of all classes, based on the number of A
articles, as the fewest number of these are in the encyclopedia. Thus, we
trained on 650 of each class of articles, and from this extremely limited
dataset, achieve decent performance.
Of course, this was only a class project, intended to be a proof of concept.
It is well known that Support Vector Machine classification consistently
outperforms other methods in the domain of text classification, and if we
were only interested in high numbers, we could have boosted them that way.
Cheers,
Brian
On Nov 8, 2007 5:27 PM, P. Birken <pbirken(a)gmail.com> wrote:
Erik should be able to help you. I read your paper and your
conclusions and you might think about rewriting them. In particular,
correctness is not and cannot be evaluated by your method and
therefore, cannot point readers to articles that are most likely
correct, simply to articles that are wellwritten. Your measure of
accuracy of your method is also a bit dubious, since the tags are not
uniform (take two featured articles of different age and they will be
of very different quality) and recovering them to 100% is therefore
not a reasonable goal. However, I believe that you method is
reasonable to find articles that are badly written.
Bye,
Philipp
2007/11/8, Brian <Brian.Mingus(a)colorado.edu >:
> Several collaborators and I are preparing to expand on previous work to
> automatically ascertain the quality of Wikipedia articles on the English
> Wikipedia (presented at Wikimania '07 [0]). PageRank is Google's
hallmark
> quality metric, and the foundation actually
has access to these numbers
> through the Google Webmaster Tools website. If a foundation
representative
> were to create a Google account and verify
that they were a "webmaster,"
> they could download the PageRank for every article on the English
Wikipedia
> in a convenient tabular format. This data
would likely serve as a
fantastic
> predictor. I would also like to compare the
Google-computed PageRank to
the
> PageRank computed via Wikipedia's
internal link structure. I don't see
any
> privacy implications in releasing this data.
It also doesn't seem to
help
> spammers much, as they already know the
pages that have a very high
> PageRank, and we include rel="nofollow" on outbound links. Nonetheless,
I
> would of course be willing to keep the data
private.
>
> This would only take a few minutes if it were approved. Is anyone out
there
> who has the power to make it happen?
>
> Cheers :)
> Brian
>
> [0]
>
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMin…
>
>
> _______________________________________________
> Wikiquality-l mailing list
> Wikiquality-l(a)lists.wikimedia.org
>
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l
_______________________________________________
Wikiquality-l mailing list
Wikiquality-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l
_______________________________________________
Wikiquality-l mailing list
Wikiquality-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikiquality-l