Ray Saintonge wrote:
Neil Harris wrote:
Phroziac wrote:
No way would we fit in the 30 volumes of Britannica for this hypothetical print release! Anyway, what if we had a feature in the Wikipedia 1.0 idea, where we could rate how useful the inclusion of an article in a print version would be. This would allow anyone making a print version, be it the foundation, or someone else, to trim wikipedia easier. Certainly you could do it by hand, but eek. that's huge. With our current database dumps, it would already not be unreasonable to make a script to automatically remove articles with stub tags in them. Obviously these would be worthless in a print version.
In my opinion, an article ranking system would be an ideal way to start collecting data for trying to place articles in rank order for inclusion in a fixed amount of space.
One interesting possibility is, in addition to user rankings, using the number of times the article's title is mentioned on the web -- the Google test -- as an extra input to any hypothetical ranking system.
The thing to remember if a ranking system is used is that it is a tool rather than a solution. It can point to problem articles that need work. We don't need to be limited to a single algorithm for evaluating an article. The Google test can be added, but so can others too.
Ec
That's right. The _gold standard_ for article assessment is peer review; the next best is based on manual ranking by a sufficiently large and well-distributed group of users; the next best is based on carefully-chosen algorithms which blend together machine-generated statistics and human-generated statistics.
Given that we have 750,000+ articles in the English-language Wikipedia alone, it is likely to take some time for reasonable amounts of votes to be accumulated for all articles. According to my earlier calculation, if we wanted to trim en: Wikipedia into 32 volumes, we would need to keep out five out of six articles. (We could keep Wikilinks in, thinly underlined with page/volume references in the margin, for those in the print version, and say dotted underlines for those which exist online but are not in the print version, to let people know there is an online article on that topic).
This raises the possibility of using machine-generated statistics to act as a proxy for manual review where it is not yet available. Given a sufficient number of human-rated articles, and a sufficient number of machine-generated statistics for articles, we could use machine learning (a.k.a function approximation) algorithms to attempt to predict the scores of as-yet-unranked articles. This could then be used as a "force multiplier" for human-based ranking, to rank articles which have not yet received sufficient human rankings to be statistically significant.
This approach could easily be sanity-checked by taking one random sample of articles as a training set, and another disjoint random sample as a testing set: the predictive power of a machine-learning algorithm trained on the training set could be determined by measuring the quality of its predictions of the true user rankings of the training set. As the number of articles with statistically significant human rankings increase, the algorithm can be re-trained repeatedly; this would also help resist attempt to "game" the ranking algorithm.
What statistics could be used as input to this kind of approach? It's not hard to think of possible measures:
0. any available user rankings, by value and number or rankings 0a. stub notices 0b. "disputed" notices, "cleanup" notices, "merge" notices, now, 0c. ...or in the past 0d. has it survived an AfD process? by what margin? 0e. what fraction of edits are reverts? 0f. has it been a featured article? 0g. has it ever been a featured article? 0h. log10(page view rate via logwood) ...and so on... 1. log10(total Google hits for exact phrase searches for title and redirects) 1a. same as above, but limited to .gov or .edu sites 1b. same as above, but using matches _within en: Wikipedia itself_ 1c. same as above, but using _the non-en: Wikipedias_ 1d. same as above, but using matches _within the 1911 EB_ 1e. same as above, but using matches _within the Brown corpus_ 1f. ditto, but within the _NIMA placename databases_ 1g, h. _Brewer's Dictionary of Phrase and Fable_, _Gray's Anatomy_ 1i, j, k, l... the Bible, the Qu'ran, the Torah, the Rig Veda... 1m, n... the collected works of Dickens, Shakespeare... ... and so on, for various other corpora... 2. log10(number of distinct editors for an article) 3. log10(total number of edits for this article, conflating sequential edits by same user) 4. log10(age of the article) 5. size of the article text in words 6. size of the article source in bytes 7. approx. "reading age/ease" of the article, using... 7a. Flesch-Kincaid Grade Level 7b. Gunning-Fog Index 7c. SMOG Index 8. number of inter-language links from/to this article 9. inwards wikilinks, including via redirects, perhaps weighted by referring article scores (although we should be careful not to infringe the Google patents) 10. # of outwards wikilinks 11. # of categories assigned 12. # of redirects created for this article 13. Bayesian scoring for individual words, using the "bag of words" model 13a. as above, but using assigned categories as tokens 13b. as above, but for words in the article title 13c. as above, but for words in edit comments 13d. as above, but for text in talk pages 13e. as above, but for names of templates 13f. as above, but for _usernames of contributing authors_, ho ho ho! 14. shortest category distance from the "fundamental" category 15. shortest wikilink distance from various "seed" pages 16. length of article title, in characters (shorter is "more fundamental"?) 17. length of article title, in words 18. what fraction of the article text contains letters from which other scripts? 19. does it contain images? how many? 19a. what is the images-to-words ratio? 20. what is the average paragraph length? 21. how many subheadings does it have? 22. how many external links does it have in its "external links" section? 23. how many inline links does it contain in the main article body? 23. how many "see also"s does it have? 24. what is the ratio of list items to overall words? 25. what is the ratio of proper nouns (crudely measured) to overall words?
..and so on, and so forth. Some of these are easy to calculate, some are hard. Can anyone think of better ones?
Individually, I doubt whether any of these are a really good predictor of article quality. However, learning algorithms are surprisingly good at pattern recognition from very noisy multi-dimensional data. It's quite possible that this approach would work with only a limited number of reasonably statistically independent input metrics (ten?); the huge list above is only to give an idea of the large number of possible choices of article metrics, ranging from the simple to the complex.
The corpus-based measures are particularly interesting; they mean we don't need to bug Google for a million search keys.
The machine learning algorithm of choice is probably a support vector machine; they're powerful, simple to use, capable of learning highly non-linear functions (for example, recognising handwritten Han characters from preprocessed bitmaps), and there are numerous pre-packaged GPL'd implementations available as tools.
No doubt there will be lots of academics who might be willing to assign this as a project or PhD topic to one of their research students. ;-)
Before any of this could be possible, we would in any case need the article ranking system to be up and running for some time, which we need anyway for the manual approach.
-- Neil
It seems a bit -- forgive me -- daft to use the number of Google hits for non-English WPs... how could that possibly be relevant?
Mark
On 05/10/05, Neil Harris usenet@tonal.clara.co.uk wrote:
Ray Saintonge wrote:
Neil Harris wrote:
Phroziac wrote:
No way would we fit in the 30 volumes of Britannica for this hypothetical print release! Anyway, what if we had a feature in the Wikipedia 1.0 idea, where we could rate how useful the inclusion of an article in a print version would be. This would allow anyone making a print version, be it the foundation, or someone else, to trim wikipedia easier. Certainly you could do it by hand, but eek. that's huge. With our current database dumps, it would already not be unreasonable to make a script to automatically remove articles with stub tags in them. Obviously these would be worthless in a print version.
In my opinion, an article ranking system would be an ideal way to start collecting data for trying to place articles in rank order for inclusion in a fixed amount of space.
One interesting possibility is, in addition to user rankings, using the number of times the article's title is mentioned on the web -- the Google test -- as an extra input to any hypothetical ranking system.
The thing to remember if a ranking system is used is that it is a tool rather than a solution. It can point to problem articles that need work. We don't need to be limited to a single algorithm for evaluating an article. The Google test can be added, but so can others too.
Ec
That's right. The _gold standard_ for article assessment is peer review; the next best is based on manual ranking by a sufficiently large and well-distributed group of users; the next best is based on carefully-chosen algorithms which blend together machine-generated statistics and human-generated statistics.
Given that we have 750,000+ articles in the English-language Wikipedia alone, it is likely to take some time for reasonable amounts of votes to be accumulated for all articles. According to my earlier calculation, if we wanted to trim en: Wikipedia into 32 volumes, we would need to keep out five out of six articles. (We could keep Wikilinks in, thinly underlined with page/volume references in the margin, for those in the print version, and say dotted underlines for those which exist online but are not in the print version, to let people know there is an online article on that topic).
This raises the possibility of using machine-generated statistics to act as a proxy for manual review where it is not yet available. Given a sufficient number of human-rated articles, and a sufficient number of machine-generated statistics for articles, we could use machine learning (a.k.a function approximation) algorithms to attempt to predict the scores of as-yet-unranked articles. This could then be used as a "force multiplier" for human-based ranking, to rank articles which have not yet received sufficient human rankings to be statistically significant.
This approach could easily be sanity-checked by taking one random sample of articles as a training set, and another disjoint random sample as a testing set: the predictive power of a machine-learning algorithm trained on the training set could be determined by measuring the quality of its predictions of the true user rankings of the training set. As the number of articles with statistically significant human rankings increase, the algorithm can be re-trained repeatedly; this would also help resist attempt to "game" the ranking algorithm.
What statistics could be used as input to this kind of approach? It's not hard to think of possible measures:
- any available user rankings, by value and number or rankings
0a. stub notices 0b. "disputed" notices, "cleanup" notices, "merge" notices, now, 0c. ...or in the past 0d. has it survived an AfD process? by what margin? 0e. what fraction of edits are reverts? 0f. has it been a featured article? 0g. has it ever been a featured article? 0h. log10(page view rate via logwood) ...and so on...
- log10(total Google hits for exact phrase searches for title and
redirects) 1a. same as above, but limited to .gov or .edu sites 1b. same as above, but using matches _within en: Wikipedia itself_ 1c. same as above, but using _the non-en: Wikipedias_ 1d. same as above, but using matches _within the 1911 EB_ 1e. same as above, but using matches _within the Brown corpus_ 1f. ditto, but within the _NIMA placename databases_ 1g, h. _Brewer's Dictionary of Phrase and Fable_, _Gray's Anatomy_ 1i, j, k, l... the Bible, the Qu'ran, the Torah, the Rig Veda... 1m, n... the collected works of Dickens, Shakespeare... ... and so on, for various other corpora... 2. log10(number of distinct editors for an article) 3. log10(total number of edits for this article, conflating sequential edits by same user) 4. log10(age of the article) 5. size of the article text in words 6. size of the article source in bytes 7. approx. "reading age/ease" of the article, using... 7a. Flesch-Kincaid Grade Level 7b. Gunning-Fog Index 7c. SMOG Index 8. number of inter-language links from/to this article 9. inwards wikilinks, including via redirects, perhaps weighted by referring article scores (although we should be careful not to infringe the Google patents) 10. # of outwards wikilinks 11. # of categories assigned 12. # of redirects created for this article 13. Bayesian scoring for individual words, using the "bag of words" model 13a. as above, but using assigned categories as tokens 13b. as above, but for words in the article title 13c. as above, but for words in edit comments 13d. as above, but for text in talk pages 13e. as above, but for names of templates 13f. as above, but for _usernames of contributing authors_, ho ho ho! 14. shortest category distance from the "fundamental" category 15. shortest wikilink distance from various "seed" pages 16. length of article title, in characters (shorter is "more fundamental"?) 17. length of article title, in words 18. what fraction of the article text contains letters from which other scripts? 19. does it contain images? how many? 19a. what is the images-to-words ratio? 20. what is the average paragraph length? 21. how many subheadings does it have? 22. how many external links does it have in its "external links" section? 23. how many inline links does it contain in the main article body? 23. how many "see also"s does it have? 24. what is the ratio of list items to overall words? 25. what is the ratio of proper nouns (crudely measured) to overall words?
..and so on, and so forth. Some of these are easy to calculate, some are hard. Can anyone think of better ones?
Individually, I doubt whether any of these are a really good predictor of article quality. However, learning algorithms are surprisingly good at pattern recognition from very noisy multi-dimensional data. It's quite possible that this approach would work with only a limited number of reasonably statistically independent input metrics (ten?); the huge list above is only to give an idea of the large number of possible choices of article metrics, ranging from the simple to the complex.
The corpus-based measures are particularly interesting; they mean we don't need to bug Google for a million search keys.
The machine learning algorithm of choice is probably a support vector machine; they're powerful, simple to use, capable of learning highly non-linear functions (for example, recognising handwritten Han characters from preprocessed bitmaps), and there are numerous pre-packaged GPL'd implementations available as tools.
No doubt there will be lots of academics who might be willing to assign this as a project or PhD topic to one of their research students. ;-)
Before any of this could be possible, we would in any case need the article ranking system to be up and running for some time, which we need anyway for the manual approach.
-- Neil
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE
Mark Williamson wrote:
It seems a bit -- forgive me -- daft to use the number of Google hits for non-English WPs... how could that possibly be relevant?
Mark
If someone is considered important enough to have their/its proper name mentioned in, say, the German or French WPs, even in passing, they probably have some significance that transcends national boundaries.
For example, try
site:wikipedia.org -site:en.wikipedia.org -site:www.wikipedia.org "Albert Einstein" -- 46,100 hits site:wikipedia.org -site:en.wikipedia.org -site:www.wikipedia.org "Kofi Annan" -- 9,240 hits
but
site:wikipedia.org -site:en.wikipedia.org -site:www.wikipedia.org "Monty Python" -- 608 hits site:wikipedia.org -site:en.wikipedia.org -site:www.wikipedia.org "Britney Spears" -- 519 hits site:wikipedia.org -site:en.wikipedia.org -site:www.wikipedia.org "Harold Wilson" -- 335 hits site:wikipedia.org -site:en.wikipedia.org -site:www.wikipedia.org "Samantha Fox" -- 86 hits
and
site:wikipedia.org -site:en.wikipedia.org -site:www.wikipedia.org "Robert Kilroy-Silk" -- 12 hits site:wikipedia.org -site:en.wikipedia.org -site:www.wikipedia.org "Carrot Top" -- 4 hits
and, of course, it works the other way:
site:wikipedia.org -site:de.wikipedia.org -site:www.wikipedia.org "Bernd das Brot" -- 9 hits .. hooray!
However, results from WPs with non-Latin scripts might be less useful, as they are more likely to transliterate names.
-- Neil
Mark Williamson wrote:
It seems a bit -- forgive me -- daft to use the number of Google hits for non-English WPs... how could that possibly be relevant?
That would be right if the number of Google hits were the only metric. If it is only one of several metrics the resulting meta-analysis will be drawn toward the mean. Another possibility is to assign a normalizing factor to each language to account for the relative frequency of that language.
Ec
Neil Harris wrote:
The corpus-based measures are particularly interesting; they mean we don't need to bug Google for a million search keys.
Although if anyone from Google is monitoring this list, and wants to give me a Google Account with 1.25M search keys, I'd be happy to set off the appropriate script... or send it to you to run.
-- Neil
On 10/5/05, Neil Harris usenet@tonal.clara.co.uk wrote:
Neil Harris wrote:
The corpus-based measures are particularly interesting; they mean we don't need to bug Google for a million search keys.
Although if anyone from Google is monitoring this list, and wants to give me a Google Account with 1.25M search keys, I'd be happy to set off the appropriate script... or send it to you to run.
In any case, the number of results reported is a very approximate estimate. See, for example, http://blog.outer-court.com/archive/2005-02-08.html#n72
I think it'd be much easier to use a standard measure of usefulness: look at access logs on wikipedia's end. If article A gets twice the number of hits per day as article B, it would seem natural that someone would be twice as likely to look it up in a paper-based encyclopedia. (There are certainly exceptions like hot news stories or controversial topics during a revert war, but I think it'd take you a long way...)
I like Neil's list too, but that, as they observed, is a lot more work.
-- Evan, monitoring this list :)
wikitech-l@lists.wikimedia.org