not enough nofollow

List overview All Threads
Download

newer

older

Using the rendering code from an...

Thumbnail fun

jidanni＠jidanni.org

17 Aug 2007 17 Aug '07

12:35 p.m.

Anyway of keeping these from wasting bandwidth, GET /index.php?title=Special:Search&ns0=1&redirs=0&searchx=1&search=487.3000 "Baiduspider GET /index.php?title=Category:130.6000&action=edit Googlebot/2.1 _without_ worrying about making pretty URLs (so one can use "index.php" in robots.txt)?

Why can't there be a nofollow added to such links next version?

Show replies by date

Rob Church

17 Aug 17 Aug

5:20 p.m.

On 17/08/07, jidanni@jidanni.org jidanni@jidanni.org wrote:

...

Anyway of keeping these from wasting bandwidth, GET /index.php?title=Special:Search&ns0=1&redirs=0&searchx=1&search=487.3000 "Baiduspider GET /index.php?title=Category:130.6000&action=edit Googlebot/2.1 _without_ worrying about making pretty URLs (so one can use "index.php" in robots.txt)?

Why can't there be a nofollow added to such links next version?

No doubt we can add rel="nofollow" to broken links, but be aware that robots are not required to adhere to this; the actual meaning is not "don't follow this link", but rather, "don't afford the page this links to any significance". [see http://microformats.org/wiki/rel-nofollow]

At present, the only way to instruct a robots exclusion standard compliant (robots.txt-compliant) robot not to follow links on a page is via an appropriate line in said robots.txt file, or using the <meta name="robots"> tag, setting the "content" attribute to contain "nofollow". And of course, it's not possible to *enforce* that this is followed without resorting to crude access control.

Rob Church

Simetrical

7:46 p.m.

On 8/17/07, jidanni@jidanni.org jidanni@jidanni.org wrote:

...

Anyway of keeping these from wasting bandwidth, GET /index.php?title=Special:Search&ns0=1&redirs=0&searchx=1&search=487.3000 "Baiduspider GET /index.php?title=Category:130.6000&action=edit Googlebot/2.1 _without_ worrying about making pretty URLs (so one can use "index.php" in robots.txt)?

You could blacklist anything with an "&" in it to achieve similar effect, I suppose.

jidanni＠jidanni.org

18 Aug 18 Aug

12:43 p.m.

S> You could blacklist anything with an "&" in it to achieve similar S> effect, I suppose.

OK, now in http://radioscanningtw.jidanni.org/robots.txt I'm trying the common extended protocol "Disallow: /*&", but I see even the standard "Disallow: /index.php?title=Special:" is ignored here:

"GET /robots.txt HTTP/1.0" 200 737 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" "GET /index.php?title=Special:Recentchangeslinked/Category:%E6%A5%AD%E8%80%85 HTTP/1.0" 200 9921 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"

Simetrical

19 Aug 19 Aug

1:08 a.m.

On 8/18/07, jidanni@jidanni.org jidanni@jidanni.org wrote:

...

OK, now in http://radioscanningtw.jidanni.org/robots.txt I'm trying the common extended protocol "Disallow: /*&"

It seems only fully-specified prefixes with no wildcards are permitted in robots.txt:

"The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html." http://www.robotstxt.org/wc/norobots.html

So there's no way to do any of this without prettified URLs, I don't think, short of listing every possible page. Or you could hack up the code to make URLs look like ?action=edit&title=Foo instead of the reverse. In light of this, it does seem like it would be a good idea to use rel="nofollow" on links to things like edit pages. Someone just needs to code it.

...

but I see even the standard "Disallow: /index.php?title=Special:" is ignored here:

Maybe you added extra line breaks? A blank line terminates a section, and any section not starting with a User-Agent line will be ignored.

Stephen Bain

1:41 a.m.

On 8/19/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

So there's no way to do any of this without prettified URLs, I don't think, short of listing every possible page. Or you could hack up the code to make URLs look like ?action=edit&title=Foo instead of the reverse. In light of this, it does seem like it would be a good idea to use rel="nofollow" on links to things like edit pages. Someone just needs to code it.

Am I missing something, or could we just do this?

"Disallow: /w/"

-- Stephen Bain stephen.bain@gmail.com

Simetrical

1:45 a.m.

On 8/18/07, Stephen Bain stephen.bain@gmail.com wrote:

...

Am I missing something, or could we just do this?

"Disallow: /w/"

That only works if you're using URL rewriting.

Stephen Bain

7:17 a.m.

On 8/19/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

On 8/18/07, Stephen Bain stephen.bain@gmail.com wrote:

...
Am I missing something, or could we just do this?

"Disallow: /w/"

That only works if you're using URL rewriting.

Oh, are we talking about the default configuration then? I thought we were talking about the configurations for the Wikimedia sites.

-- Stephen Bain stephen.bain@gmail.com

Simetrical

7:27 a.m.

On 8/19/07, Stephen Bain stephen.bain@gmail.com wrote:

...

Oh, are we talking about the default configuration then? I thought we were talking about the configurations for the Wikimedia sites.

Wikimedia sites are fine. They use Disallow: /w/ or some equivalent.

jidanni＠jidanni.org

4:28 a.m.

...

Maybe you added extra line breaks?

I don't see any in http://radioscanningtw.jidanni.org/robots.txt

Some crawlers adhere to their published syntax extensions, e.g., http://www.google.com/bot.html .

Some crawlers' URLs mention extensions, but then those crawlers don't seem to even follow the vanilla robots.txt standard.

Some crawlers' URLs don't mention any extensions, saying that they just use the vanilla robots.txt standard. But then some don't seem to adhere to what they just said.

Rob Church

4:48 p.m.

On 19/08/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

reverse. In light of this, it does seem like it would be a good idea to use rel="nofollow" on links to things like edit pages. Someone just needs to code it.

As I stated in a previous post,

* rel="nofollow" *does not mean* "do not follow this link" - this is a known issue with the naming of the attribute value * User agents are free to ignore it * We offer up edit pages with "noindex,nofollow"

The context of the original post was in reducing "wasted" bandwidth, which I have to point out is not going to be possible - if a URL exists on the web, then it is liable to be accessed in some form.

Robots can and will ignore the so-called standards, and thus "waste bandwidth". Adding rel="nofollow" to edit links won't affect those robots which don't adhere to it, or which acknowledge the precise semantic meaning of it (and follow it, but don't assign significance to the page later).

It's therefore going to be of negligible benefit to do this, trivial though it is; if a robot ignores the attribute, then it'll follow the link, and if it ignores the "noindex,nofollow" meta tag, it'll index it anyway.

Rob Church

Simetrical

7:37 p.m.

On 8/19/07, Rob Church robchur@gmail.com wrote:

...

rel="nofollow" *does not mean* "do not follow this link" - this is a

known issue with the naming of the attribute value

User agents are free to ignore it

The question is, do Google, Yahoo!, and/or Microsoft follow rel="nofollow" links? Those are the three major spiders in my experience, and stopping them from wasting their time would certainly save a lot of page views.

I wouldn't underestimate the effect that can have on a server, either. At times on my site, the majority of users are spiders. Randomly checking now, I find my bulletin board's session tracker registers 89 bots active in the last 15 minutes, out of 537 users. I've occasionally had several hundred bots active -- the record my board keeps for "most users ever online", 1242, was mainly bots. Not that those figures are scientific (they record logged-in users plus IP addresses with at least one activity in the past 15 minutes, not total number of activities in the past 15 minutes), but I have no doubt that it could make a significant performance difference to be rid of unwanted bots, and result in faster indexing because they don't have to toss out the contents of the pages they request.

Simetrical

7:53 p.m.

On 8/19/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

The question is, do Google, Yahoo!, and/or Microsoft follow rel="nofollow" links?

The answer to this, from a quick Google, appears to be no, yes, maybe respectively. So it will help a bit, but not as much as it might.

Thomas Dalton

7:59 p.m.

...

...
The question is, do Google, Yahoo!, and/or Microsoft follow rel="nofollow" links?

The answer to this, from a quick Google, appears to be no, yes, maybe respectively. So it will help a bit, but not as much as it might.

Is it just me that finds it ironic that we're solving problems with search engines by Googling them?

The Fearow

11:39 p.m.

Nope. It's less ironic than googling "google" to find the main site (which a lot of my less-tech-savvy friends do), and search engines are the best source of info on these topics.

-Matt

...

From: "Thomas Dalton" thomas.dalton@gmail.com Reply-To: Wikimedia developers wikitech-l@lists.wikimedia.org To: "Wikimedia developers" wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] not enough nofollow Date: Sun, 19 Aug 2007 20:59:53 +0100 MIME-Version: 1.0 Received: from lists.wikimedia.org ([145.97.39.157]) by bay0-mc2-f4.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668); Sun, 19 Aug 2007 13:00:07 -0700 Received: from localhost ([127.0.0.1]:38469 helo=lily.knams.wikimedia.org)by lily.knams.wikimedia.org with esmtp (Exim 4.63)(envelope-from wikitech-l-bounces@lists.wikimedia.org)id 1IMqwI-0006zv-PO; Sun, 19 Aug 2007 20:00:00 +0000 Received: from wa-out-1112.google.com ([209.85.146.182]:3517)by lily.knams.wikimedia.org with esmtp (Exim 4.63)(envelope-from thomas.dalton@gmail.com) id 1IMqwE-0006zp-KOfor wikitech-l@lists.wikimedia.org; Sun, 19 Aug 2007 19:59:56 +0000 Received: by wa-out-1112.google.com with SMTP id l24so548622waffor wikitech-l@lists.wikimedia.org;Sun, 19 Aug 2007 12:59:53 -0700 (PDT) Received: by 10.114.193.1 with SMTP id q1mr943769waf.1187553593094;Sun, 19 Aug 2007 12:59:53 -0700 (PDT) Received: by 10.114.159.2 with HTTP; Sun, 19 Aug 2007 12:59:53 -0700 (PDT) X-Message-Delivery: Vj0zLjQuMDt1cz0wO2k9MDtsPTA7YT0w X-Message-Info: 0jbW5ANosZIB1jdEUq3pbdySo5np5UrYbe9HGFBxaM+rBtOEzjfGqlBkLTVysBrVUBYhAXKIQrI4hwGpwwn5Kw== References: 7c2a12e20708171246s781391d6k63ad5eb7624daf07@mail.gmail.com 87mywp109p.fsf@jidanni.org 7c2a12e20708181808r4bba4c51l7987c8023b6b3602@mail.gmail.com e92136380708190948m43044848s22ef035b548f9faf@mail.gmail.com 7c2a12e20708191237s51707f2am1887261b41285618@mail.gmail.com 7c2a12e20708191253r3fd4455fvb55c1ba17ef22bf4@mail.gmail.com X-BeenThere: wikitech-l@lists.wikimedia.org X-Mailman-Version: 2.1.9 Precedence: list List-Id: Wikimedia developers <wikitech-l.lists.wikimedia.org> List-Unsubscribe: http://lists.wikimedia.org/mailman/listinfo/wikitech-l,mailto:wikitech-l-request@lists.wikimedia.org?subject=unsubscribe List-Archive: http://lists.wikimedia.org/pipermail/wikitech-l List-Post: mailto:wikitech-l@lists.wikimedia.org List-Help: mailto:wikitech-l-request@lists.wikimedia.org?subject=help List-Subscribe: http://lists.wikimedia.org/mailman/listinfo/wikitech-l,mailto:wikitech-l-request@lists.wikimedia.org?subject=subscribe Errors-To: wikitech-l-bounces@lists.wikimedia.org Return-Path: wikitech-l-bounces@lists.wikimedia.org X-OriginalArrivalTime: 19 Aug 2007 20:00:07.0918 (UTC) FILETIME=[8AC2D0E0:01C7E29B]

...
...
The question is, do Google, Yahoo!, and/or Microsoft follow rel="nofollow" links?

The answer to this, from a quick Google, appears to be no, yes, maybe respectively. So it will help a bit, but not as much as it might.

Is it just me that finds it ironic that we're solving problems with search engines by Googling them?

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

_________________________________________________________________ Live Search delivers results the way you like it. Try live.com now! http://www.live.com

Anthony

9:38 p.m.

On 8/18/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...

So there's no way to do any of this without prettified URLs, I don't think, short of listing every possible page. Or you could hack up the code to make URLs look like ?action=edit&title=Foo instead of the reverse.

Way back in the day I separated out index.php and edit.php for exactly this reason. I later changed it to http://mydomain/edit/Title when I went to pretty urls, but the original poster asked what to use "_without_ worrying about making pretty URLs". cp index.php to edit.php and hack up the code to change the URL would work. I don't know if your solution would work or not (don't know if the robots.txt spec allows you to distinguish based on get parameters if they appear at the beginning).

Brion Vibber

20 Aug 20 Aug

12:45 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Anthony wrote:

...

On 8/18/07, Simetrical Simetrical+wikilist@gmail.com wrote:

...
So there's no way to do any of this without prettified URLs, I don't think, short of listing every possible page. Or you could hack up the code to make URLs look like ?action=edit&title=Foo instead of the reverse.

Way back in the day I separated out index.php and edit.php for exactly this reason. I later changed it to http://mydomain/edit/Title when I went to pretty urls, but the original poster asked what to use "_without_ worrying about making pretty URLs". cp index.php to edit.php and hack up the code to change the URL would work. I don't know if your solution would work or not

Something like this should do:

$wgActionPaths['edit'] = "/edit.php?title=$1";

where edit.php consists of:

<?php $_REQUEST['action'] = 'edit'; require './index.php'; ?>

...

(don't know if the robots.txt spec allows you to distinguish based on get parameters if they appear at the beginning).

robots.txt spec is pretty primitive and only allows for specifying complete prefixes.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGyYz5wRnhpk1wk44RAmsNAJ0WUC+MFqXj0PA30bPkVKLtOVHJdQCff9Tx 6cyK0zXx9NHSki4uJ3tkNio= =EjLa -----END PGP SIGNATURE-----

Simetrical

5:25 p.m.

On 8/20/07, Brion Vibber brion@wikimedia.org wrote:

...

Something like this should do:

$wgActionPaths['edit'] = "/edit.php?title=$1";

where edit.php consists of:

<?php $_REQUEST['action'] = 'edit'; require './index.php'; ?>

Wow, why do I totally not know about all these cool customization options. :(

6187

Age (days ago)

6190

Last active (days ago)

wikitech-l@lists.wikimedia.org

17 comments

8 participants

tags (0)

participants (8)

Anthony
Brion Vibber
jidanni＠jidanni.org
Rob Church
Simetrical
Stephen Bain
The Fearow
Thomas Dalton