Phroziac wrote:
No way would we fit in the 30 volumes of Britannica for this hypothetical print release! Anyway, what if we had a feature in the Wikipedia 1.0 idea, where we could rate how useful the inclusion of an article in a print version would be. This would allow anyone making a print version, be it the foundation, or someone else, to trim wikipedia easier. Certainly you could do it by hand, but eek. that's huge. With our current database dumps, it would already not be unreasonable to make a script to automatically remove articles with stub tags in them. Obviously these would be worthless in a print version.
What do you all think?
Just for comparison, the current edition of the EB has about 44M words in 32 volumes. As of July 13, the English-language Wikipedia contained 649,000 articles, and a total of roughly 224M words.
Wikipedia currently has over 750,000 articles, so assuming that article size has not reduced, it probably has around 258M words. This is almost six times the size of the EB, and would take at least 187 volumes of EB-equivalent size.
In my opinion, an article ranking system would be an ideal way to start collecting data for trying to place articles in rank order for inclusion in a fixed amount of space.
One interesting possibility is, in addition to user rankings, using the number of times the article's title is mentioned on the web -- the Google test -- as an extra input to any hypothetical ranking system.
For example, using this very crude test:
"America" -- 1,260,000,000 "Papua New Guinea" -- 68,400,000 "gallbladder" -- 2,670,000 "Basement Jaxx" -- 2,320,000 "Hilbert space" -- 1,770,000 "catecholamine" -- 1,200,000 "Xenu" -- 595,000 "Horatio Nelson" -- 403,000 "Toad the Wet Sprocket" -- 354,000 [!] "lutefisk" -- 200,000 "Weebl and Bob" -- 169,000 "Wallace and Futuna" -- 531, but "Wallace et Futuna" -- 20,400 "Beaker folk" -- 777, but "Beaker People" -- 16,700
but, on the other hand,
"Bokak Atoll" -- 498, but "Taongi Atoll" -- 1,140 1715 "riot act" -- 943 1714 "riot act" -- 718 <a minor British celebrity of the 1970s> -- 714 "renifleurism" -- 275 "2-Hydroxyglutaricaciduria" -- 66
Now, this ranking procedure is not perfect: the Wallace and Futuna islands clearly shouldn't be left out of any encyclopedia, and porn stars will be wildly over-ranked due to search-spamming -- but at least it gives a start to establishing the fame or notoriety of any given subject. Given the apparent Zipf distribution, perhaps the logarithm of the Google page count would be an appropriate measure: "America" would score 9, "Papua New Guinea" 7.4, "Wallace and/et Futuna" 3.4, and "2-Hydroxyglutaricaciduria" 1.8, using logs to base 10.
Other measures might be to look only at .gov/.gov.uk, or .edu/.ac.uk etc. sites, to gain some idea of relative governmental or academic interest in these subjects, perhaps as a measure of seriousness (interestingly, Toad the Wet Sprocket still get 14 hits in .gov sites).
Still, it would be an interesting exercise to look up all current articles and their redirects. Does anyone have a Google API account charged with approximately 1.2 million searches? At one search a second, we could have the figures ready in about two weeks.
-- Neil * *