..even though it knows about them. Daniel Brandt explains the problem
here:
http://www.google-watch.org/dying.html
Of course that article is hyperbole, but you can see the problem if you
search for articles on Wikipedia that do not include the word "MediaWiki"
(which occurs on every page), e.g.:
http://www.google.com/search?q=site%3Aen.wikipedia.org+-mediawiki
This lists the pages on
en.wikipedia.org that are not indexed, presently
255,000 (Google apparently has a list of URLs with no associated content).
That includes dynamic URLs, redirects and other duplicates, but it also
includes many pages that should be indexed. E.g. if I search for the text
of the English article "Potassium", I currently get plenty of mirrors, but
not Wikipedia itself.
I have a few questions, which someone with more Google-Fu than I may be
able to answer:
1) Why exactly doesn't Google index these articles?
2) Is there a way we can make them index them?
3) Is there a clever way to get a list of only the actual articles that
aren't indexed, as opposed to the redirects, edit pages and dupes? That
would help in estimating the scope of the problem.
It also to me makes it clear that we can't just rely on Google to do our
searching for us. It is not just outdated, it is also quite incomplete. We
really need to develop a working search infrastructure. In the meantime,
we might want to add some of our "good mirrors" to the Google search
query.
Regards,
Erik