With regards to quality assessment features, I recommend reading through
our paper from WikiSym this year:
The related work section contains quite a lot of the previous research on
predicting article quality, so there should be plenty of useful reading.
As James points out, content and number of footnote references are a good
There are a lot of dependencies when it comes to predicting article
quality. If you're trying to predict High quality vs everything else, the
task isn't overly difficult. Otherwise it could be more challenging, for
instance there are quite a bit of difference between the FAs and GAs on
English Wikipedia, and in your case you'll probably find the A-class
articles mess things up because their length tends to be somewhere between
the other two and they're of high quality. I'm currently of the opinion
that an A-class article is simply an FAC that hasn't been submitted for FA
You might of course run into problems with different citation traditions if
you're working across language editions. English uses footnotes heavily,
others might instead use bibliography sections and not really cite specific
claims in the article text. (An issue we mention in our article when we
tried to get our model to work on Norwegian (bokmål) and Swedish Wikipedia).
My $.02, if you'd like to discuss this more, feel free to get in touch.
On 15 December 2013 07:15, Klein,Max <kleinm(a)oclc.org> wrote:
Wiki Research Junkies,
I am investigating the comparative quality of articles about Cote
d'Ivoire and Uganda versus other countries. I wanted to answer the question
of what makes high-quality articles? Can anyone point me to any existing
research on heuristics of Article Quality? That is, determining an articles
quality by the wikitext properties, without human rating? I would also
consider using data from the Article Feedback Tools, if there were dumps
available for each Article in English, French, and Swahili Wikipedias.
This is all the raw data I can seem to find
The heuristic technique that I currently using is training a naive
Bayesian filter based on:
Text length in each section
Infoboxes in each section.
Filled parameters in each infobox
Images in each section
Good Article, Featured Article?
Then Normalize on Page Views per on population / speakers of native
Can you also think of any other dimensions or heuristics to
Wikipedian in Residence, OCLC
Wiki-research-l mailing list