[Wikipedia-l] Look Who's Using Wikipedia
Ray Saintonge
saintonge at telus.net
Fri Mar 16 18:11:14 UTC 2007
Lars Aronsson wrote:
>Delirium wrote:
>
>
>>So I'd be surprised if we're "done" covering even the top-tier subjects
>>before we get to 3 million articles, if even then.
>>
>>
>Corpus linguists first collect a very large body (a corpus) of
>text, then they count how many times each word occurs. If every
>20th word (or 5% of the corpus) is "the", then a dictionary only
>containing "the" will "cover" 5% of this corpus. Most spelling
>dictionaries can cover 95-98 % of any normal corpus. But creating
>a dictionary that covers the last few percents is hard, because
>any normal text will contain a few very uncommon words. There is
>a very long, thin tail.
>
Good coverage is easier for dictionaries than for many other areas of
knowledge. For major languages excellent resources already exist.
>Could we compile a "corpus" of questions, and see how large a
>percentage of them can be answered by Wikipedia? That probably
>requires artificial intelligence, if not science fiction. (Hey,
>did somebody write a novel about this already? Sci-fi can be a
>great source of inspiration.) I guess we could compile a list of
>famous places and people, and see what percentage of them have
>articles of reasonable length. But having an entry on St
>Petersburg, Florida, is less important than having entries on
>London or Paris. So the list must be weighted. Search companies
>like Google or MSN keep logs of every query that people type in,
>so they know exactly how many times more often people search for
>London than for St Petersburg, Florida. That's the kind of
>weights we would need to compute a coverage, I guess.
>
An interesting source here would be published quiz and trivia books.
While one must weigh the reliability of their information carefully,
they can still be used as a way of testing coverage.
Ec
More information about the Wikipedia-l
mailing list