Dear Sirs,
Wikipedia is a large, collaborative project to produce a free encyclopedia. There are currently nearly 100,000 articles, and about 200,000 page impressions per day.
As of last weekend, all Wikipedia articles seem to have disappeared from the Google index. Wikipedia article URLs look like this:
http://www.wikipedia.org/wiki/<article-name>
We are not aware of an outage that might have caused the Google spider to miss the pages. I would much appreciate it if you could shed some light on the issue.
Sincerely,
Erik Moeller
On mar, 2003-01-07 at 05:25, Erik Moeller wrote:
As of last weekend, all Wikipedia articles seem to have disappeared from the Google index. Wikipedia article URLs look like this:
http://www.wikipedia.org/wiki/<article-name>
We are not aware of an outage that might have caused the Google spider to miss the pages. I would much appreciate it if you could shed some light on the issue.
Upon noticing that the main pages and mailing list archives _are_ indexed, I have my suspicions about our robots.txt file; the line:
Disallow: /w
perhaps should be:
Disallow: /w/
The former may be accidentally blocking /wiki/<arcticle-name> paths -- which of course form the bulk of our content! -- in addition to the scripted pages via direct access to the /w subdirectory that it's intended to block.
I have updated the robots.txt file; if indeed this is how the googlebot was interpreting the line, I hope we can be respidered soon...
-- brion vibber (brion@pobox.com / brion@wikipedia.org)
Brion Vibber wrote:
On mar, 2003-01-07 at 05:25, Erik Moeller wrote: Upon noticing that the main pages and mailing list archives _are_ indexed, I have my suspicions about our robots.txt file; the line:
Disallow: /w
perhaps should be:
Disallow: /w/
The former may be accidentally blocking /wiki/<arcticle-name> paths -- which of course form the bulk of our content! -- in addition to the scripted pages via direct access to the /w subdirectory that it's intended to block.
I have updated the robots.txt file; if indeed this is how the googlebot was interpreting the line, I hope we can be respidered soon...
-- brion vibber (brion@pobox.com / brion@wikipedia.org)
Alas, I think you're reight. The robots exclusion standard suggests a simple substring comparison be used in implementations, and all their examples of directory exclusion use the slash.
-- Neil
wikitech-l@lists.wikimedia.org