New subject: [WikiEN-l] Ranking articles using machine-generated stats

5 Oct 2005


      Ray Saintonge wrote:
...
Neil Harris wrote:
...
Phroziac wrote:
...
No way would we fit in the 30 volumes of Britannica for this
hypothetical print release! Anyway, what if we had a feature in the
Wikipedia 1.0 idea, where we could rate how useful the inclusion of an
article in a print version would be. This would allow anyone making a
print version, be it the foundation, or someone else, to trim wikipedia
easier. Certainly you could do it by hand, but eek. that's huge. With
our current database dumps, it would already not be unreasonable to 
make
a script to automatically remove articles with stub tags in them.
Obviously these would be worthless in a print version.
In my opinion, an article ranking system would be an ideal way to 
start collecting data for trying to place articles in rank order for 
inclusion in a fixed amount of space.
One interesting possibility is, in addition to user rankings, using 
the number of times the article's title is mentioned on the web -- 
the Google test -- as an extra input to any hypothetical ranking system.
The thing to remember if a ranking system is used is that it is a tool 
rather than a solution.  It can point to problem articles that need 
work.  We don't need to be limited to a single algorithm for 
evaluating an article.  The Google test can be added, but so can 
others too.
Ec
That's right. The _gold standard_ for article assessment is peer review; 
the next best is based on manual ranking by a sufficiently large and 
well-distributed group of users; the next best is based on 
carefully-chosen algorithms which blend together machine-generated 
statistics and human-generated statistics.
Given that we have 750,000+ articles in the English-language Wikipedia 
alone, it is likely to take some time for reasonable amounts of votes to 
be accumulated for all articles. According to my earlier calculation, if 
we wanted to trim en: Wikipedia into 32 volumes, we would need to keep 
out five out of six articles. (We could keep Wikilinks in, thinly 
underlined with page/volume references in the margin, for those in the 
print version, and say dotted underlines for those which exist online 
but are not in the print version, to let people know there is an online 
article on that topic).
This raises the possibility of using machine-generated statistics to act 
as a proxy for manual review where it is not yet available. Given a 
sufficient number of human-rated articles, and a sufficient number of 
machine-generated statistics for articles, we could use machine learning 
(a.k.a function approximation) algorithms to attempt to predict the 
scores of as-yet-unranked articles. This could then be used as a "force 
multiplier" for human-based ranking, to rank articles which have not yet 
received sufficient human rankings to be statistically significant.
This approach could easily be sanity-checked by taking one random sample 
of articles as a training set, and another disjoint random sample as a 
testing set: the predictive power of a machine-learning algorithm 
trained on the training set could be determined by measuring the quality 
of its predictions of the true user rankings of the training set. As the 
number of articles with statistically significant human rankings 
increase, the algorithm can be re-trained repeatedly; this would also 
help resist attempt to "game" the ranking algorithm.
What statistics could be used as input to this kind of approach? It's 
not hard to think of possible measures:
0. any available user rankings, by value and number or rankings
0a. stub notices
0b. "disputed" notices, "cleanup" notices, "merge" notices, now,
0c. ...or in the past
0d. has it survived an AfD process? by what margin?
0e. what fraction of edits are reverts?
0f. has it been a featured article?
0g. has it ever been a featured article?
0h. log10(page view rate via logwood)
...and so on...
1. log10(total Google hits for exact phrase searches for title and 
redirects)
1a. same as above, but limited to .gov or .edu sites
1b. same as above, but using matches _within en: Wikipedia itself_
1c. same as above, but using _the non-en: Wikipedias_
1d. same as above, but using matches _within the 1911 EB_
1e. same as above, but using matches _within the Brown corpus_
1f. ditto, but within the _NIMA placename databases_
1g, h. _Brewer's Dictionary of Phrase and Fable_, _Gray's Anatomy_
1i, j, k, l... the Bible, the Qu'ran, the Torah, the Rig Veda...
1m, n... the collected works of Dickens, Shakespeare...
... and so on, for various other corpora...
2. log10(number of distinct editors for an article)
3. log10(total number of edits for this article, conflating sequential 
edits by same user)
4. log10(age of the article)
5. size of the article text in words
6. size of the article source in bytes
7. approx. "reading age/ease" of the article, using...
7a. Flesch-Kincaid Grade Level
7b. Gunning-Fog Index
7c. SMOG Index
8. number of inter-language links from/to this article
9. inwards wikilinks, including via redirects, perhaps weighted by 
referring article scores (although we should be careful not to infringe 
the Google patents)
10. # of outwards wikilinks
11. # of categories assigned
12. # of redirects created for this article
13. Bayesian scoring for individual words, using the "bag of words" model
13a. as above, but using assigned categories as tokens
13b. as above, but for words in the article title
13c. as above, but for words in edit comments
13d. as above, but for text in talk pages
13e. as above, but for names of templates
13f. as above, but for _usernames of contributing authors_, ho ho ho!
14. shortest category distance from the "fundamental" category
15. shortest wikilink distance from various "seed" pages
16. length of article title, in characters (shorter is "more fundamental"?)
17. length of article title, in words
18. what fraction of the article text contains letters from which other 
scripts?
19. does it contain images? how many?
19a. what is the images-to-words ratio?
20. what is the average paragraph length?
21. how many subheadings does it have?
22. how many external links does it have in its "external links" section?
23. how many inline links does it contain in the main article body?
23. how many "see also"s does it have?
24. what is the ratio of list items to overall words?
25. what is the ratio of proper nouns (crudely measured) to overall words?
..and so on, and so forth. Some of these are easy to calculate, some are 
hard. Can anyone think of better ones?
Individually, I doubt whether any of these are a really good predictor 
of article quality. However, learning algorithms are surprisingly good 
at pattern recognition from very noisy multi-dimensional data. It's 
quite possible that this approach would work with only a limited number 
of reasonably statistically independent input metrics (ten?); the huge 
list above is only to give an idea of the large number of possible 
choices of article metrics, ranging from the simple to the complex.
The corpus-based measures are particularly interesting; they mean we 
don't need to bug Google for a million search keys.
The machine learning algorithm of choice is probably a support vector 
machine; they're powerful, simple to use, capable of learning highly 
non-linear functions (for example, recognising handwritten Han 
characters from preprocessed bitmaps), and there are numerous 
pre-packaged GPL'd implementations available as tools.
No doubt there will be lots of academics who might be willing to assign 
this as a project or PhD topic to one of their research students. ;-)
Before any of this could be possible, we would in any case need the 
article ranking system to be up and running for some time, which we need 
anyway for the manual approach.
-- Neil