Max,

With regards to quality assessment features, I recommend reading through our paper from WikiSym this year: http://www-users.cs.umn.edu/~morten/publications/wikisym2013-tellmemore.pdf

The related work section contains quite a lot of the previous research on predicting article quality, so there should be plenty of useful reading.  As James points out, content and number of footnote references are a good start.

There are a lot of dependencies when it comes to predicting article quality.  If you're trying to predict High quality vs everything else, the task isn't overly difficult.  Otherwise it could be more challenging, for instance there are quite a bit of difference between the FAs and GAs on English Wikipedia, and in your case you'll probably find the A-class articles mess things up because their length tends to be somewhere between the other two and they're of high quality.  I'm currently of the opinion that an A-class article is simply an FAC that hasn't been submitted for FA review yet.

You might of course run into problems with different citation traditions if you're working across language editions.  English uses footnotes heavily, others might instead use bibliography sections and not really cite specific claims in the article text. (An issue we mention in our article when we tried to get our model to work on Norwegian (bokmål) and Swedish Wikipedia).

My $.02, if you'd like to discuss this more, feel free to get in touch.


Cheers,
Morten




On 15 December 2013 07:15, Klein,Max <kleinm@oclc.org> wrote:
Wiki Research Junkies,

I am investigating the comparative quality of articles about  Cote d'Ivoire and Uganda versus other countries. I wanted to answer the question of what makes high-quality articles? Can anyone point me to any existing research on heuristics of Article Quality? That is, determining an articles quality by the wikitext properties, without human rating? I would also consider using data from the Article Feedback Tools, if there were dumps available for each Article in English, French, and Swahili Wikipedias.  This is all the raw data I can seem to find  http://toolserver.org/~dartar/aft5/dumps/

The heuristic technique that I currently using is training a naive Bayesian filter based on:
  • Per Section.

    • Text length in each section

    • Infoboxes in each section.

      • Filled parameters in each infobox

    • Images in each section

  • Good Article, Featured Article?

  • Then Normalize on Page Views per on population / speakers of native language

Can you also think of any other dimensions or heuristics to programatically rate?


Best,

Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l