[WikiEN-l] Hypothetical print version, and using Google to rank articles
Neil Harris
neil at tonal.clara.co.uk
Wed Oct 5 00:58:17 UTC 2005
Phroziac wrote:
> No way would we fit in the 30 volumes of Britannica for this
> hypothetical print release! Anyway, what if we had a feature in the
> Wikipedia 1.0 idea, where we could rate how useful the inclusion of an
> article in a print version would be. This would allow anyone making a
> print version, be it the foundation, or someone else, to trim wikipedia
> easier. Certainly you could do it by hand, but eek. that's huge. With
> our current database dumps, it would already not be unreasonable to make
> a script to automatically remove articles with stub tags in them.
> Obviously these would be worthless in a print version.
>
> What do you all think?
>
Just for comparison, the current edition of the EB has about 44M words
in 32 volumes. As of July 13, the English-language Wikipedia contained
649,000 articles, and a total of roughly 224M words.
Wikipedia currently has over 750,000 articles, so assuming that article
size has not reduced, it probably has around 258M words. This is almost
six times the size of the EB, and would take at least 187 volumes of
EB-equivalent size.
In my opinion, an article ranking system would be an ideal way to start
collecting data for trying to place articles in rank order for inclusion
in a fixed amount of space.
One interesting possibility is, in addition to user rankings, using the
number of times the article's title is mentioned on the web -- the
Google test -- as an extra input to any hypothetical ranking system.
For example, using this very crude test:
"America" -- 1,260,000,000
"Papua New Guinea" -- 68,400,000
"gallbladder" -- 2,670,000
"Basement Jaxx" -- 2,320,000
"Hilbert space" -- 1,770,000
"catecholamine" -- 1,200,000
"Xenu" -- 595,000
"Horatio Nelson" -- 403,000
"Toad the Wet Sprocket" -- 354,000 [!]
"lutefisk" -- 200,000
"Weebl and Bob" -- 169,000
"Wallace and Futuna" -- 531, but "Wallace et Futuna" -- 20,400
"Beaker folk" -- 777, but "Beaker People" -- 16,700
but, on the other hand,
"Bokak Atoll" -- 498, but "Taongi Atoll" -- 1,140
1715 "riot act" -- 943
1714 "riot act" -- 718
<a minor British celebrity of the 1970s> -- 714
"renifleurism" -- 275
"2-Hydroxyglutaricaciduria" -- 66
Now, this ranking procedure is not perfect: the Wallace and Futuna
islands clearly shouldn't be left out of any encyclopedia, and porn
stars will be wildly over-ranked due to search-spamming -- but at least
it gives a start to establishing the fame or notoriety of any given
subject. Given the apparent Zipf distribution, perhaps the logarithm of
the Google page count would be an appropriate measure: "America" would
score 9, "Papua New Guinea" 7.4, "Wallace and/et Futuna" 3.4, and
"2-Hydroxyglutaricaciduria" 1.8, using logs to base 10.
Other measures might be to look only at .gov/.gov.uk, or .edu/.ac.uk
etc. sites, to gain some idea of relative governmental or academic
interest in these subjects, perhaps as a measure of seriousness
(interestingly, Toad the Wet Sprocket still get 14 hits in .gov sites).
Still, it would be an interesting exercise to look up all current
articles and their redirects. Does anyone have a Google API account
charged with approximately 1.2 million searches? At one search a second,
we could have the figures ready in about two weeks.
-- Neil
*
*
More information about the WikiEN-l
mailing list