On 20 Mar 2003, Brion Vibber wrote:
That wouldn't have gone to an edit page, just to a blank page. The problem here is that an actual edit URL got into google at some point and is still coming up in results.
Sannse, we *do* exclude edit pages from google's and other bots' spiders, doubly:
robots.txt excludes access to the /w/ subdirectory, and thus all direct script actions (edits, histories, diffs, printable mode, changing options/length on recentchanges, etc), so it shouldn't be touching them at all.
edit pages and such have meta tags telling robots "noindex,nofollow"; ie that if they do end up with that page, they shouldn't index it, and shouldn't follow links from it, but should just toss the page out and go back where it came from.
A few have somehow gotten through. I'm not sure how. They may be old and not yet flushed (googlebot is still going over the site and hasn't reindexed every page yet). Note that in the google results there's no summary extract, no cache, no notice of the size. It's just a raw URL sitting there in the results. That's weird and wrong, and to me indicates a problem in their index.
This is more than a few - do a search for "site:wikipedia.org action=edit'. That gives an estimated 148,000 hits; only the first two seem to be NOT edit pages. It seems to me that google works according to one of the following two procedures: * When a URL is forbidden by the robots.txt, they do keep the URL in the database, but do not follow it. Thus, the page will be in the database, but only with its URL, without any title or content * When a new URL is found, it is added to the database in the manner described above. Only at a later stage it is found to be forbidden by the robots.txt, and thrown out again - to then be moved back in when a link to it is found.
Andre Engels