Hey folks,
I just finished working with Amir[1,2] and building off of some of Morten's
work[3] to put together something that I think you're going to like.
Halfaker, Aaron (2016): Monthly Wikipedia article quality predictions.
figshare.
https://dx.doi.org/10.6084/m9.figshare.3859800
Retrieved: 00 56, Oct 12, 2016 (GMT)
This dataset contains a row for every article-month since 20010101. Each
row has an article quality prediction based on text-only machine classifier
(from [3] with slight improvement) and hosted by ORES[4]. We've managed to
build models for English, French, and Russian Wikipedia, so I've generated
datasets for each of those wikis. It's current as of 2016-08-01 and I plan
to run updates periodically.
Here are the columns:
- page_id -- The page identifier
- page_title -- The title of the article (UTF-8_with_underscores)
- rev_id -- The most recent revision ID at the time of assessment
- timestamp -- The timestamp when the assessment was taken
(YYYYMMDDHHMMSS)
- prediction -- The predicted quality class ("Stub", "Start",
"C", "B",
"GA", "FA", ...)
- weighted_sum -- The sum of prediction weights assuming indexed class
ordering ("Stub" = 0, "Start" = 1, ...)
I'll update the docs based on your questions :)
1.
https://phabricator.wikimedia.org/p/Ladsgroup/
2.
https://github.com/Ladsgroup
3.
http://www-users.cs.umn.edu/~morten/publications/
wikisym2013-tellmemore.pdf
4.
https://ores.wikimedia.org/
-Aaron