Hi All,
Wikipedia's robots.txt file (http://www.wikipedia.org/robots.txt) excludes robots from action pages (edit, history, etc.) with this:
User-agent: * Disallow: /w/
But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.
Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?
Thanks,
Roger Chrisman http://Wikigogy.org (MediaWiki 1.6.7)
On 9/25/06, Roger Chrisman roger@rogerchrisman.com wrote:
But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.
Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?
I do the same thing, and I never did figure out the rules to disallow the other sub-pages.
As I understand, there are "nofol" tags within the web pages itself, but I'm not certain that's being honoured.
Hello,
Excluding index.php using robots.txt should work if an article link on your page is http://mysite.tld/My_Page. The robots would then not crawl http://mysite.tld/index.php?title=My_Page&action=edit, etc.
I hope that this helps, Kasimir
On 10/1/06, Sy Ali sy1234@gmail.com wrote:
On 9/25/06, Roger Chrisman roger@rogerchrisman.com wrote:
But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.
Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?
I do the same thing, and I never did figure out the rules to disallow the other sub-pages.
As I understand, there are "nofol" tags within the web pages itself, but I'm not certain that's being honoured. _______________________________________________ MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Kasimir Gabert wrote:
Hello,
Excluding index.php using robots.txt should work if an article link on your page is http://mysite.tld/My_Page. The robots would then not crawl http://mysite.tld/index.php?title=My_Page&action=edit, etc.
Kasimir, I believe you have written above a beautiful solution for my need. My article links on my site (http://wikigogy.org) are indeed done without reference to index.php but the 'edit', 'history' and other action pages that I wish to exclude are done with that reference. I had not realized this simple elegant solution. I will try it. It should look like this in my-wiki/robots.txt, right?:
User-agent: * Disallow: index.php*
Is the asterisk on index.php* correct and needed?
Thank you, Roger
On 10/1/06, Sy Ali sy1234@gmail.com wrote:
On 9/25/06, Roger Chrisman roger@rogerchrisman.com wrote:
But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.
Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?
I do the same thing, and I never did figure out the rules to disallow the other sub-pages.
As I understand, there are "nofol" tags within the web pages itself, but I'm not certain that's being honoured. _______________________________________________ MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
Roger Chrisman wrote:
Kasimir Gabert wrote:
Hello,
Excluding index.php using robots.txt should work if an article link on your page is http://mysite.tld/My_Page. The robots would then not crawl http://mysite.tld/index.php?title=My_Page&action=edit, etc.
Kasimir, I believe you have written above a beautiful solution for my need. My article links on my site (http://wikigogy.org) are indeed done without reference to index.php but the 'edit', 'history' and other action pages that I wish to exclude are done with that reference. I had not realized this simple elegant solution. I will try it. It should look like this in my-wiki/robots.txt, right?:
User-agent: * Disallow: index.php*
Is the asterisk on index.php* correct and needed?
I think I should NOT have the asterisk in the URL prefix. I think asterisk is only for the User-agent line, meaning all robots. I think it should look like this in my-site/robots.txt:
User-agent: * Disallow: index.php
and it will disallow robots from everything that is, or starts with, "index.php", which all the action page URLs do start with on my site but not article names because I am using pretty urls.
I read up on robots.txt here: * http://www.robotstxt.org/wc/norobots.html#format * http://www.robotstxt.org/wc/exclusion-admin.html
Thanks :-) Roger
Roger Chrisman wrote:
User-agent: * Disallow: index.php
User-agent: * Disallow: /index.php
I added a leading slash character per advice of the following fabulous robots.txt validator:
http://tool.motoricerca.info/robots-checker.phtml
^ a very cool robots.txt validator.
Thanks, Roger
Sy Ali sy1234@gmail.com escribió: On 9/25/06, Roger Chrisman wrote:
But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.
Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?
I do the same thing, and I never did figure out the rules to disallow the other sub-pages.
As I understand, there are "nofol" tags within the web pages itself, but I'm not certain that's being honoured. _______________________________________________ MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l
-- Ing. Raúl Vera CIFH - UNSE ---------------------------------
LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com
mediawiki-l@lists.wikimedia.org