Lars Aronsson wrote:
Delirium wrote:
So I'd be surprised if we're
"done" covering even the top-tier subjects
before we get to 3 million articles, if even then.
Corpus linguists first collect a very large body (a corpus) of
text, then they count how many times each word occurs. If every
20th word (or 5% of the corpus) is "the", then a dictionary only
containing "the" will "cover" 5% of this corpus. Most spelling
dictionaries can cover 95-98 % of any normal corpus. But creating
a dictionary that covers the last few percents is hard, because
any normal text will contain a few very uncommon words. There is
a very long, thin tail.
Good coverage is easier for dictionaries than for many other areas of
knowledge. For major languages excellent resources already exist.
Could we compile a "corpus" of questions, and
see how large a
percentage of them can be answered by Wikipedia? That probably
requires artificial intelligence, if not science fiction. (Hey,
did somebody write a novel about this already? Sci-fi can be a
great source of inspiration.) I guess we could compile a list of
famous places and people, and see what percentage of them have
articles of reasonable length. But having an entry on St
Petersburg, Florida, is less important than having entries on
London or Paris. So the list must be weighted. Search companies
like Google or MSN keep logs of every query that people type in,
so they know exactly how many times more often people search for
London than for St Petersburg, Florida. That's the kind of
weights we would need to compute a coverage, I guess.
An interesting source here would be published quiz and trivia books.
While one must weigh the reliability of their information carefully,
they can still be used as a way of testing coverage.
Ec