Re: [Wikipedia-l] Okhrana and Google

21 Mar 2003

      On 20 Mar 2003, Brion Vibber wrote:
...
That wouldn't have gone to an edit page, just to a blank page. The
problem here is that an actual edit URL got into google at some point
and is still coming up in results.
Sannse, we *do* exclude edit pages from google's and other bots'
spiders, doubly:

robots.txt excludes access to the /w/ subdirectory, and thus all
direct script actions (edits, histories, diffs, printable mode,
changing options/length on recentchanges, etc), so it shouldn't be
touching them at all.

edit pages and such have meta tags telling robots "noindex,nofollow";
ie that if they do end up with that page, they shouldn't index it, and
shouldn't follow links from it, but should just toss the page out and
go back where it came from.

A few have somehow gotten through. I'm not sure how. They may be old and
not yet flushed (googlebot is still going over the site and hasn't
reindexed every page yet). Note that in the google results there's no
summary extract, no cache, no notice of the size. It's just a raw URL
sitting there in the results. That's weird and wrong, and to me
indicates a problem in their index.
This is more than a few - do a search for "site:wikipedia.org action=edit'.
That gives an estimated 148,000 hits; only the first two seem to be NOT
edit pages. It seems to me that google works according to one of the following
two procedures:
* When a URL is forbidden by the robots.txt, they do keep the URL in the
  database, but do not follow it. Thus, the page will be in the database,
  but only with its URL, without any title or content
* When a new URL is found, it is added to the database in the manner described
  above. Only at a later stage it is found to be forbidden by the robots.txt,
  and thrown out again - to then be moved back in when a link to it is found.
Andre Engels

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] Okhrana and Google