[Wikipedia-l] Okhrana and Google

Andre Engels engels at uni-koblenz.de
Fri Mar 21 12:01:50 UTC 2003


On 20 Mar 2003, Brion Vibber wrote:

> That wouldn't have gone to an edit page, just to a blank page. The
> problem here is that an actual edit URL got into google at some point
> and is still coming up in results.
>
> Sannse, we *do* exclude edit pages from google's and other bots'
> spiders, doubly:
>
> * robots.txt excludes access to the /w/ subdirectory, and thus all
>   direct script actions (edits, histories, diffs, printable mode,
>   changing options/length on recentchanges, etc), so it shouldn't be
>   touching them at all.
>
> * edit pages and such have meta tags telling robots "noindex,nofollow";
>   ie that if they do end up with that page, they shouldn't index it, and
>   shouldn't follow links from it, but should just toss the page out and
>   go back where it came from.
>
> A few have somehow gotten through. I'm not sure how. They may be old and
> not yet flushed (googlebot is still going over the site and hasn't
> reindexed every page yet). Note that in the google results there's no
> summary extract, no cache, no notice of the size. It's just a raw URL
> sitting there in the results. That's weird and wrong, and to me
> indicates a problem in their index.

This is more than a few - do a search for "site:wikipedia.org action=edit'.
That gives an estimated 148,000 hits; only the first two seem to be NOT
edit pages. It seems to me that google works according to one of the following
two procedures:
* When a URL is forbidden by the robots.txt, they do keep the URL in the
  database, but do not follow it. Thus, the page will be in the database,
  but only with its URL, without any title or content
* When a new URL is found, it is added to the database in the manner described
  above. Only at a later stage it is found to be forbidden by the robots.txt,
  and thrown out again - to then be moved back in when a link to it is found.

Andre Engels




More information about the Wikipedia-l mailing list