Re: [Wikipedia-l] Look Who's Using Wikipedia

16 Mar 2007

Delirium wrote:

...
  So I'd be surprised if we're "done"
covering even the top-tier subjects 
 before we get to 3 million articles, if even then. 
After "size" (number of "real" articles), the next measurement 
must be "coverage".  But how do we define that?  Surely, coverage 
was 0% on the first day and will be 100% if Wikipedia can answer 
any possible question, but this is an unrealistic limit.  Maybe 
Britannica is at 90% and Wikipedia at 80%?  But we don't know how 
to define the "coverage" of an encyclopedia to begin with.

Corpus linguists first collect a very large body (a corpus) of 
text, then they count how many times each word occurs.  If every 
20th word (or 5% of the corpus) is "the", then a dictionary only 
containing "the" will "cover" 5% of this corpus.  Most spelling 
dictionaries can cover 95-98 % of any normal corpus.  But creating 
a dictionary that covers the last few percents is hard, because 
any normal text will contain a few very uncommon words.  There is 
a very long, thin tail.

Could we compile a "corpus" of questions, and see how large a 
percentage of them can be answered by Wikipedia?  That probably 
requires artificial intelligence, if not science fiction.  (Hey, 
did somebody write a novel about this already?  Sci-fi can be a 
great source of inspiration.)  I guess we could compile a list of 
famous places and people, and see what percentage of them have 
articles of reasonable length.  But having an entry on St 
Petersburg, Florida, is less important than having entries on 
London or Paris.  So the list must be weighted.  Search companies 
like Google or MSN keep logs of every query that people type in, 
so they know exactly how many times more often people search for 
London than for St Petersburg, Florida.  That's the kind of 
weights we would need to compute a coverage, I guess.

-- 
  Lars Aronsson (lars(a)aronsson.se)
  Aronsson Datateknik - http://aronsson.se

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] Look Who's Using Wikipedia