[WikiEN-l] Hypothetical print version, and using Google to rank articles

Neil Harris neil at tonal.clara.co.uk
Wed Oct 5 00:58:17 UTC 2005


Phroziac wrote:

> No way would we fit in the 30 volumes of Britannica for this
> hypothetical print release! Anyway, what if we had a feature in the
> Wikipedia 1.0 idea, where we could rate how useful the inclusion of an
> article in a print version would be. This would allow anyone making a
> print version, be it the foundation, or someone else, to trim wikipedia
> easier. Certainly you could do it by hand, but eek. that's huge. With
> our current database dumps, it would already not be unreasonable to make
> a script to automatically remove articles with stub tags in them.
> Obviously these would be worthless in a print version.
>
> What do you all think?
>
Just for comparison, the current edition of the EB has about 44M words 
in 32 volumes. As of July 13, the English-language Wikipedia contained 
649,000 articles, and a total of roughly 224M words.

Wikipedia currently has over 750,000 articles, so assuming that article 
size has not reduced, it probably has around 258M words. This is almost 
six times the size of the EB, and would take at least 187 volumes of 
EB-equivalent size.

In my opinion, an article ranking system would be an ideal way to start 
collecting data for trying to place articles in rank order for inclusion 
in a fixed amount of space.

One interesting possibility is, in addition to user rankings, using the 
number of times the article's title is mentioned on the web -- the 
Google test -- as an extra input to any hypothetical ranking system.

For example, using this very crude test:

"America" -- 1,260,000,000
"Papua New Guinea" -- 68,400,000
"gallbladder" -- 2,670,000
"Basement Jaxx" -- 2,320,000
"Hilbert space" -- 1,770,000
"catecholamine" -- 1,200,000
"Xenu" -- 595,000
"Horatio Nelson" -- 403,000
"Toad the Wet Sprocket" -- 354,000 [!]
"lutefisk" -- 200,000
"Weebl and Bob" -- 169,000
"Wallace and Futuna" -- 531, but "Wallace et Futuna" -- 20,400
"Beaker folk" -- 777, but "Beaker People" -- 16,700

but, on the other hand,

"Bokak Atoll" -- 498, but "Taongi Atoll" -- 1,140
1715 "riot act" -- 943
1714 "riot act" -- 718
<a minor British celebrity of the 1970s> -- 714
"renifleurism" -- 275
"2-Hydroxyglutaricaciduria" -- 66

Now, this ranking procedure is not perfect: the Wallace and Futuna 
islands clearly shouldn't be left out of any encyclopedia, and porn 
stars will be wildly over-ranked due to search-spamming -- but at least 
it gives a start to establishing the fame or notoriety of any given 
subject. Given the apparent Zipf distribution, perhaps the logarithm of 
the Google page count would be an appropriate measure: "America" would 
score 9, "Papua New Guinea" 7.4, "Wallace and/et Futuna" 3.4, and 
"2-Hydroxyglutaricaciduria" 1.8, using logs to base 10.

Other measures might be to look only at .gov/.gov.uk, or .edu/.ac.uk 
etc. sites, to gain some idea of relative governmental or academic 
interest in these subjects, perhaps as a measure of seriousness 
(interestingly, Toad the Wet Sprocket still get 14 hits in .gov sites).

Still, it would be an interesting exercise to look up all current 
articles and their redirects. Does anyone have a Google API account 
charged with approximately 1.2 million searches? At one search a second, 
we could have the figures ready in about two weeks.

-- Neil
*
*



More information about the WikiEN-l mailing list