[Wikipedia-l] Look Who's Using Wikipedia

Fri Mar 16 18:11:14 UTC 2007

Lars Aronsson wrote:

>Delirium wrote:
>  
>
>>So I'd be surprised if we're "done" covering even the top-tier subjects 
>>before we get to 3 million articles, if even then.
>>    
>>
>Corpus linguists first collect a very large body (a corpus) of 
>text, then they count how many times each word occurs.  If every 
>20th word (or 5% of the corpus) is "the", then a dictionary only 
>containing "the" will "cover" 5% of this corpus.  Most spelling 
>dictionaries can cover 95-98 % of any normal corpus.  But creating 
>a dictionary that covers the last few percents is hard, because 
>any normal text will contain a few very uncommon words.  There is 
>a very long, thin tail.
>
Good coverage is easier for dictionaries than for many other areas of 
knowledge.  For major languages excellent resources already exist.

>Could we compile a "corpus" of questions, and see how large a 
>percentage of them can be answered by Wikipedia?  That probably 
>requires artificial intelligence, if not science fiction.  (Hey, 
>did somebody write a novel about this already?  Sci-fi can be a 
>great source of inspiration.)  I guess we could compile a list of 
>famous places and people, and see what percentage of them have 
>articles of reasonable length.  But having an entry on St 
>Petersburg, Florida, is less important than having entries on 
>London or Paris.  So the list must be weighted.  Search companies 
>like Google or MSN keep logs of every query that people type in, 
>so they know exactly how many times more often people search for 
>London than for St Petersburg, Florida.  That's the kind of 
>weights we would need to compute a coverage, I guess.
>
An interesting source here would be published quiz and trivia books.  
While one must weigh the reliability of their information carefully, 
they can still be used as a way of testing coverage.

Ec