robots.txt

List overview All Threads
Download

newer

older

apostrophe in search/name creation...

How to realize automated logoning

Roger Chrisman

25 Sep 2006 25 Sep '06

5:53 p.m.

Hi All,

Wikipedia's robots.txt file (http://www.wikipedia.org/robots.txt) excludes robots from action pages (edit, history, etc.) with this:

User-agent: * Disallow: /w/

But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.

Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?

Thanks,

Roger Chrisman http://Wikigogy.org (MediaWiki 1.6.7)

Show replies by date

Sy Ali

1 Oct 1 Oct

4:38 a.m.

New subject: [Mediawiki-l] robots.txt

On 9/25/06, Roger Chrisman roger@rogerchrisman.com wrote:

...

But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.

Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?

I do the same thing, and I never did figure out the rules to disallow the other sub-pages.

As I understand, there are "nofol" tags within the web pages itself, but I'm not certain that's being honoured.

Kasimir Gabert

8:32 p.m.

New subject: [Mediawiki-l] robots.txt

Hello,

Excluding index.php using robots.txt should work if an article link on your page is http://mysite.tld/My_Page. The robots would then not crawl http://mysite.tld/index.php?title=My_Page&action=edit, etc.

I hope that this helps, Kasimir

On 10/1/06, Sy Ali sy1234@gmail.com wrote:

...

On 9/25/06, Roger Chrisman roger@rogerchrisman.com wrote:

...
But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.

Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?

I do the same thing, and I never did figure out the rules to disallow the other sub-pages.

As I understand, there are "nofol" tags within the web pages itself, but I'm not certain that's being honoured. _______________________________________________ MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

-- Kasimir Gabert

Roger Chrisman

2 Oct 2 Oct

2:21 a.m.

New subject: [Mediawiki-l] robots.txt

Kasimir Gabert wrote:

...

Hello,

Excluding index.php using robots.txt should work if an article link on your page is http://mysite.tld/My_Page. The robots would then not crawl http://mysite.tld/index.php?title=My_Page&action=edit, etc.

Kasimir, I believe you have written above a beautiful solution for my need. My article links on my site (http://wikigogy.org) are indeed done without reference to index.php but the 'edit', 'history' and other action pages that I wish to exclude are done with that reference. I had not realized this simple elegant solution. I will try it. It should look like this in my-wiki/robots.txt, right?:

User-agent: * Disallow: index.php*

Is the asterisk on index.php* correct and needed?

Thank you, Roger

...

On 10/1/06, Sy Ali sy1234@gmail.com wrote:

...
On 9/25/06, Roger Chrisman roger@rogerchrisman.com wrote:

...
But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.

Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?

I do the same thing, and I never did figure out the rules to disallow the other sub-pages.

As I understand, there are "nofol" tags within the web pages itself, but I'm not certain that's being honoured. _______________________________________________ MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

Roger Chrisman

2:43 a.m.

New subject: [Mediawiki-l] robots.txt

Roger Chrisman wrote:

...

Kasimir Gabert wrote:

...
Hello,

Excluding index.php using robots.txt should work if an article link on your page is http://mysite.tld/My_Page. The robots would then not crawl http://mysite.tld/index.php?title=My_Page&action=edit, etc.

Kasimir, I believe you have written above a beautiful solution for my need. My article links on my site (http://wikigogy.org) are indeed done without reference to index.php but the 'edit', 'history' and other action pages that I wish to exclude are done with that reference. I had not realized this simple elegant solution. I will try it. It should look like this in my-wiki/robots.txt, right?:

User-agent: * Disallow: index.php*

Is the asterisk on index.php* correct and needed?

I think I should NOT have the asterisk in the URL prefix. I think asterisk is only for the User-agent line, meaning all robots. I think it should look like this in my-site/robots.txt:

User-agent: * Disallow: index.php

and it will disallow robots from everything that is, or starts with, "index.php", which all the action page URLs do start with on my site but not article names because I am using pretty urls.

I read up on robots.txt here: * http://www.robotstxt.org/wc/norobots.html#format * http://www.robotstxt.org/wc/exclusion-admin.html

Thanks :-) Roger

Roger Chrisman

2:52 a.m.

New subject: [Mediawiki-l] robots.txt

Roger Chrisman wrote:

...

User-agent: * Disallow: index.php

User-agent: * Disallow: /index.php

I added a leading slash character per advice of the following fabulous robots.txt validator:

http://tool.motoricerca.info/robots-checker.phtml

^ a very cool robots.txt validator.

Thanks, Roger

Raúl Vera

4 Oct 4 Oct

9:01 a.m.

New subject: [Mediawiki-l] robots.txt

Sy Ali sy1234@gmail.com escribió: On 9/25/06, Roger Chrisman wrote:

...

But in the interest of short URLs, I serve my MediaWiki directly from site / without any /wiki/ or /w/ directories. So above meathod would not work on my installation.

Any ideas how I can exclude robots from crawling all my wiki's edit, history, talk, etc, pages *without* excluding its article pages?

I do the same thing, and I never did figure out the rules to disallow the other sub-pages.

As I understand, there are "nofol" tags within the web pages itself, but I'm not certain that's being honoured. _______________________________________________ MediaWiki-l mailing list MediaWiki-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

-- Ing. Raúl Vera CIFH - UNSE ---------------------------------

LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com

6657

Age (days ago)

6666

Last active (days ago)

mediawiki-l@lists.wikimedia.org

6 comments

4 participants

tags (0)

participants (4)

Kasimir Gabert
Raúl Vera
Roger Chrisman
Sy Ali