Google doesn't index thousands of pages - Wikitech-l

13 Sep 2004

..even though it knows about them. Daniel Brandt explains the problem  
here:
http://www.google-watch.org/dying.html

Of course that article is hyperbole, but you can see the problem if you  
search for articles on Wikipedia that do not include the word "MediaWiki"  
(which occurs on every page), e.g.:

http://www.google.com/search?q=site%3Aen.wikipedia.org+-mediawiki

This lists the pages on en.wikipedia.org that are not indexed, presently  
255,000 (Google apparently has a list of URLs with no associated content).  
That includes dynamic URLs, redirects and other duplicates, but it also  
includes many pages that should be indexed. E.g. if I search for the text  
of the English article "Potassium", I currently get plenty of mirrors, but  
not Wikipedia itself.

I have a few questions, which someone with more Google-Fu than I may be  
able to answer:
1) Why exactly doesn't Google index these articles?
2) Is there a way we can make them index them?
3) Is there a clever way to get a list of only the actual articles that  
aren't indexed, as opposed to the redirects, edit pages and dupes? That  
would help in estimating the scope of the problem.

It also to me makes it clear that we can't just rely on Google to do our  
searching for us. It is not just outdated, it is also quite incomplete. We  
really need to develop a working search infrastructure. In the meantime,  
we might want to add some of our "good mirrors" to the Google search  
query.

Regards,

Erik