On Thu, Apr 10, 2003 at 08:46:21PM -0700, Brion Vibber wrote:
On Thu, 2003-04-10 at 19:58, Nick Reinking wrote:
I think the misunderstanding is not on Google's part. As far as I can tell, Google isn't indexing that page.
A quick search on google for "wibrator wikipedia" shows a subsection for the edit link. Note that it doesn't have any 'Cached' link. This means that google saw a link to the edit page in a page that could be indexed.
I didn't say it was being cached, that its content could be word-searched, or that it had been spidered through to other pages. I said it was *indexed*. Now, maybe Google uses some word other than "indexed" to mean "contained in a database of links which are shown to users when they search for words contained in the link". I'll buy that. Maybe the word they use is "florble". In that case, the page is being florbled despite our best efforts to stop it from being florbled.
Is there any way we can tell google not to florble pages that are explicitly excluded by our robots.txt file so that people will stop complaining to *us* about google's overzealous florbling?
Hypothetically we could jimmy the page to not produce edit links if the user agent is googlebot, but that would be very annoying for several reasons:
- The google-cached page would be missing those links.
- This would screw with page caching. Google hits a lot of pages, and
we'd have to either not cache any of its hits or be very careful in coding around it.
-- brion vibber (brion @ pobox.com)
I've always understood 'indexed' to mean 'downloaded the entire page and added its contents to a searchable database.' As far as I know, robots.txt just tells google (and everybody else) not to download the page; it doesn't say they can't link to it. Since Masturbacja says to follow links, but robots.txt says not to index edit links, Google does the sensible thing: creates the link in its database, but doesn't index the content. Go figure; the Google engineers would probably cooperate with you if you asked them nicely. :)