It is worth noting that I've been having some similar problems with googlebot on a different system (CivicSpace). It appeared as if Google was disregarding the robots.txt file.
It appears as if they don't reload the robots.txt file with any sort of regularity, but you can request them to do so.
Here is the message I got:
-------------------------------------------------------- Once you have added the appropriate robots.txt file entries or meta tags, you'll need to process your removal request through our public removal tool. You can access this tool at http://services.google.com:8882/urlconsole/controller?cmd=reload&lastcmd... in
For more information please visit http://www.google.com/remove.html --------------------------------------------------------
When I went to their service, I was told that even doing that request it may take them 24 hours before it is processed.
Aldon
-----Original Message----- Date: Tue, 16 Aug 2005 17:18:38 +0200 From: Thomas Koll tomk32@gmx.de Subject: Re: [Mediawiki-l] googlebot To: MediaWiki announcements and site admin list mediawiki-l@Wikimedia.org Message-ID: C08D472C-A409-4D08-89C2-57AD8E347E12@gmx.de Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Am 16.08.2005 um 17:14 schrieb andres:
Accidentally i watched my logfiles (tail -f) and noticed a weird behaviour of the google spider. It spends literally days following all links in Spezial:Recentchanges.
It may be a apache configuration mistake (how?), but it also may be a mediawiki problems.
How can i disallow search engines from indexing all the recent changes? They are worthless to index anyway.
you can use http://en.wikipedia.org/robots.txt
ciao, tom
Ok, robots.txt seem to be worth to use it. At the moment google search endless in /mediawiki/index.php?title=Spezial:Recentchanges&from=2005... What is the rule to avoid this type of request? It is not in my interest to disallow index.php, only ?title=Spezial:Recentchanges i didn't find any example in www.robotstxt.org. Andres Obrero
beside the problem of
Aldon Hynes schrieb:
It is worth noting that I've been having some similar problems with googlebot on a different system (CivicSpace). It appeared as if Google was disregarding the robots.txt file.
It appears as if they don't reload the robots.txt file with any sort of regularity, but you can request them to do so.
Here is the message I got:
Once you have added the appropriate robots.txt file entries or meta tags, you'll need to process your removal request through our public removal tool. You can access this tool at http://services.google.com:8882/urlconsole/controller?cmd=reload&lastcmd... in
For more information please visit http://www.google.com/remove.html
When I went to their service, I was told that even doing that request it may take them 24 hours before it is processed.
Aldon
-----Original Message----- Date: Tue, 16 Aug 2005 17:18:38 +0200 From: Thomas Koll tomk32@gmx.de Subject: Re: [Mediawiki-l] googlebot To: MediaWiki announcements and site admin list mediawiki-l@Wikimedia.org Message-ID: C08D472C-A409-4D08-89C2-57AD8E347E12@gmx.de Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Am 16.08.2005 um 17:14 schrieb andres:
Accidentally i watched my logfiles (tail -f) and noticed a weird behaviour of the google spider. It spends literally days following all links in Spezial:Recentchanges.
It may be a apache configuration mistake (how?), but it also may be a mediawiki problems.
How can i disallow search engines from indexing all the recent changes? They are worthless to index anyway.
you can use http://en.wikipedia.org/robots.txt
ciao, tom
MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
On 17/08/05, Andres Obrero andres@holzapfel.ch wrote:
Ok, robots.txt seem to be worth to use it. At the moment google search endless in /mediawiki/index.php?title=Spezial:Recentchanges&from=2005... What is the rule to avoid this type of request? It is not in my interest to disallow index.php, only ?title=Spezial:Recentchanges
Well, as Christof says, you could use the following to tell bots not to look at RC at all: User-Agent:* Disallow: /wiki/Special:Recentchanges Disallow: /wiki/Special%3ARecentchanges
Or you could take the same approach as Wikimedia, and only let spiders access pages with no extra parameters - assuming you have a rewrite rule to access "plain" pages as "/wiki/page" or somesuch; see http://mail.wikipedia.org/pipermail/wikitech-l/2005-August/031032.html
The advantage being that you may actually want spiders to index your recent changes, because it helps them spot what needs indexing.
On 8/17/05, Andres Obrero andres@holzapfel.ch wrote:
Ok, robots.txt seem to be worth to use it. At the moment google search endless in /mediawiki/index.php?title=Spezial:Recentchanges&from=2005... What is the rule to avoid this type of request? It is not in my interest to disallow index.php, only ?title=Spezial:Recentchanges i didn't find any example in www.robotstxt.org. Andres Obrero
I don't think that this is possible using ONLY robots.txt
The way to do this is to setup LocalSettings.php so as to use a different path prefix for articles vs. other things in the wiki, and use apache mod_rewrite to separate the paths. Mediawiki has the concept of an article path and a script path. The normal thing is to have the article path be /wiki/ and the script path be /w/
for more see: http://meta.wikimedia.org/wiki/Robots.txt and http://meta.wikimedia.org/wiki/Rewrite_Rules
This requires that you have access to the server to set up the rewrite rules, and be warned that using mod_rewrite is one of the more complex tasks in configuring mediawiki.
mediawiki-l@lists.wikimedia.org