Anyway of keeping these from wasting bandwidth, GET /index.php?title=Special:Search&ns0=1&redirs=0&searchx=1&search=487.3000 "Baiduspider GET /index.php?title=Category:130.6000&action=edit Googlebot/2.1 _without_ worrying about making pretty URLs (so one can use "index.php" in robots.txt)?
Why can't there be a nofollow added to such links next version?
On 17/08/07, jidanni@jidanni.org jidanni@jidanni.org wrote:
Anyway of keeping these from wasting bandwidth, GET /index.php?title=Special:Search&ns0=1&redirs=0&searchx=1&search=487.3000 "Baiduspider GET /index.php?title=Category:130.6000&action=edit Googlebot/2.1 _without_ worrying about making pretty URLs (so one can use "index.php" in robots.txt)?
Why can't there be a nofollow added to such links next version?
No doubt we can add rel="nofollow" to broken links, but be aware that robots are not required to adhere to this; the actual meaning is not "don't follow this link", but rather, "don't afford the page this links to any significance". [see http://microformats.org/wiki/rel-nofollow]
At present, the only way to instruct a robots exclusion standard compliant (robots.txt-compliant) robot not to follow links on a page is via an appropriate line in said robots.txt file, or using the <meta name="robots"> tag, setting the "content" attribute to contain "nofollow". And of course, it's not possible to *enforce* that this is followed without resorting to crude access control.
Rob Church
On 8/17/07, jidanni@jidanni.org jidanni@jidanni.org wrote:
Anyway of keeping these from wasting bandwidth, GET /index.php?title=Special:Search&ns0=1&redirs=0&searchx=1&search=487.3000 "Baiduspider GET /index.php?title=Category:130.6000&action=edit Googlebot/2.1 _without_ worrying about making pretty URLs (so one can use "index.php" in robots.txt)?
You could blacklist anything with an "&" in it to achieve similar effect, I suppose.
S> You could blacklist anything with an "&" in it to achieve similar S> effect, I suppose.
OK, now in http://radioscanningtw.jidanni.org/robots.txt I'm trying the common extended protocol "Disallow: /*&", but I see even the standard "Disallow: /index.php?title=Special:" is ignored here:
"GET /robots.txt HTTP/1.0" 200 737 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" "GET /index.php?title=Special:Recentchangeslinked/Category:%E6%A5%AD%E8%80%85 HTTP/1.0" 200 9921 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
On 8/18/07, jidanni@jidanni.org jidanni@jidanni.org wrote:
OK, now in http://radioscanningtw.jidanni.org/robots.txt I'm trying the common extended protocol "Disallow: /*&"
It seems only fully-specified prefixes with no wildcards are permitted in robots.txt:
"The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html." http://www.robotstxt.org/wc/norobots.html
So there's no way to do any of this without prettified URLs, I don't think, short of listing every possible page. Or you could hack up the code to make URLs look like ?action=edit&title=Foo instead of the reverse. In light of this, it does seem like it would be a good idea to use rel="nofollow" on links to things like edit pages. Someone just needs to code it.
but I see even the standard "Disallow: /index.php?title=Special:" is ignored here:
Maybe you added extra line breaks? A blank line terminates a section, and any section not starting with a User-Agent line will be ignored.
On 8/19/07, Simetrical Simetrical+wikilist@gmail.com wrote:
So there's no way to do any of this without prettified URLs, I don't think, short of listing every possible page. Or you could hack up the code to make URLs look like ?action=edit&title=Foo instead of the reverse. In light of this, it does seem like it would be a good idea to use rel="nofollow" on links to things like edit pages. Someone just needs to code it.
Am I missing something, or could we just do this?
"Disallow: /w/"
On 8/18/07, Stephen Bain stephen.bain@gmail.com wrote:
Am I missing something, or could we just do this?
"Disallow: /w/"
That only works if you're using URL rewriting.
On 8/19/07, Simetrical Simetrical+wikilist@gmail.com wrote:
On 8/18/07, Stephen Bain stephen.bain@gmail.com wrote:
Am I missing something, or could we just do this?
"Disallow: /w/"
That only works if you're using URL rewriting.
Oh, are we talking about the default configuration then? I thought we were talking about the configurations for the Wikimedia sites.
On 8/19/07, Stephen Bain stephen.bain@gmail.com wrote:
Oh, are we talking about the default configuration then? I thought we were talking about the configurations for the Wikimedia sites.
Wikimedia sites are fine. They use Disallow: /w/ or some equivalent.
Maybe you added extra line breaks?
I don't see any in http://radioscanningtw.jidanni.org/robots.txt
Some crawlers adhere to their published syntax extensions, e.g., http://www.google.com/bot.html .
Some crawlers' URLs mention extensions, but then those crawlers don't seem to even follow the vanilla robots.txt standard.
Some crawlers' URLs don't mention any extensions, saying that they just use the vanilla robots.txt standard. But then some don't seem to adhere to what they just said.
On 19/08/07, Simetrical Simetrical+wikilist@gmail.com wrote:
reverse. In light of this, it does seem like it would be a good idea to use rel="nofollow" on links to things like edit pages. Someone just needs to code it.
As I stated in a previous post,
* rel="nofollow" *does not mean* "do not follow this link" - this is a known issue with the naming of the attribute value * User agents are free to ignore it * We offer up edit pages with "noindex,nofollow"
The context of the original post was in reducing "wasted" bandwidth, which I have to point out is not going to be possible - if a URL exists on the web, then it is liable to be accessed in some form.
Robots can and will ignore the so-called standards, and thus "waste bandwidth". Adding rel="nofollow" to edit links won't affect those robots which don't adhere to it, or which acknowledge the precise semantic meaning of it (and follow it, but don't assign significance to the page later).
It's therefore going to be of negligible benefit to do this, trivial though it is; if a robot ignores the attribute, then it'll follow the link, and if it ignores the "noindex,nofollow" meta tag, it'll index it anyway.
Rob Church
On 8/19/07, Rob Church robchur@gmail.com wrote:
- rel="nofollow" *does not mean* "do not follow this link" - this is a
known issue with the naming of the attribute value
- User agents are free to ignore it
The question is, do Google, Yahoo!, and/or Microsoft follow rel="nofollow" links? Those are the three major spiders in my experience, and stopping them from wasting their time would certainly save a lot of page views.
I wouldn't underestimate the effect that can have on a server, either. At times on my site, the majority of users are spiders. Randomly checking now, I find my bulletin board's session tracker registers 89 bots active in the last 15 minutes, out of 537 users. I've occasionally had several hundred bots active -- the record my board keeps for "most users ever online", 1242, was mainly bots. Not that those figures are scientific (they record logged-in users plus IP addresses with at least one activity in the past 15 minutes, not total number of activities in the past 15 minutes), but I have no doubt that it could make a significant performance difference to be rid of unwanted bots, and result in faster indexing because they don't have to toss out the contents of the pages they request.
On 8/19/07, Simetrical Simetrical+wikilist@gmail.com wrote:
The question is, do Google, Yahoo!, and/or Microsoft follow rel="nofollow" links?
The answer to this, from a quick Google, appears to be no, yes, maybe respectively. So it will help a bit, but not as much as it might.
The question is, do Google, Yahoo!, and/or Microsoft follow rel="nofollow" links?
The answer to this, from a quick Google, appears to be no, yes, maybe respectively. So it will help a bit, but not as much as it might.
Is it just me that finds it ironic that we're solving problems with search engines by Googling them?
Nope. It's less ironic than googling "google" to find the main site (which a lot of my less-tech-savvy friends do), and search engines are the best source of info on these topics.
-Matt
From: "Thomas Dalton" thomas.dalton@gmail.com Reply-To: Wikimedia developers wikitech-l@lists.wikimedia.org To: "Wikimedia developers" wikitech-l@lists.wikimedia.org Subject: Re: [Wikitech-l] not enough nofollow Date: Sun, 19 Aug 2007 20:59:53 +0100 MIME-Version: 1.0 Received: from lists.wikimedia.org ([145.97.39.157]) by bay0-mc2-f4.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668); Sun, 19 Aug 2007 13:00:07 -0700 Received: from localhost ([127.0.0.1]:38469 helo=lily.knams.wikimedia.org)by lily.knams.wikimedia.org with esmtp (Exim 4.63)(envelope-from wikitech-l-bounces@lists.wikimedia.org)id 1IMqwI-0006zv-PO; Sun, 19 Aug 2007 20:00:00 +0000 Received: from wa-out-1112.google.com ([209.85.146.182]:3517)by lily.knams.wikimedia.org with esmtp (Exim 4.63)(envelope-from thomas.dalton@gmail.com) id 1IMqwE-0006zp-KOfor wikitech-l@lists.wikimedia.org; Sun, 19 Aug 2007 19:59:56 +0000 Received: by wa-out-1112.google.com with SMTP id l24so548622waffor wikitech-l@lists.wikimedia.org;Sun, 19 Aug 2007 12:59:53 -0700 (PDT) Received: by 10.114.193.1 with SMTP id q1mr943769waf.1187553593094;Sun, 19 Aug 2007 12:59:53 -0700 (PDT) Received: by 10.114.159.2 with HTTP; Sun, 19 Aug 2007 12:59:53 -0700 (PDT) X-Message-Delivery: Vj0zLjQuMDt1cz0wO2k9MDtsPTA7YT0w X-Message-Info: 0jbW5ANosZIB1jdEUq3pbdySo5np5UrYbe9HGFBxaM+rBtOEzjfGqlBkLTVysBrVUBYhAXKIQrI4hwGpwwn5Kw== References: 7c2a12e20708171246s781391d6k63ad5eb7624daf07@mail.gmail.com87mywp109p.fsf@jidanni.org7c2a12e20708181808r4bba4c51l7987c8023b6b3602@mail.gmail.come92136380708190948m43044848s22ef035b548f9faf@mail.gmail.com7c2a12e20708191237s51707f2am1887261b41285618@mail.gmail.com7c2a12e20708191253r3fd4455fvb55c1ba17ef22bf4@mail.gmail.com X-BeenThere: wikitech-l@lists.wikimedia.org X-Mailman-Version: 2.1.9 Precedence: list List-Id: Wikimedia developers <wikitech-l.lists.wikimedia.org> List-Unsubscribe: http://lists.wikimedia.org/mailman/listinfo/wikitech-l,mailto:wikitech-l-request@lists.wikimedia.org?subject=unsubscribe List-Archive: http://lists.wikimedia.org/pipermail/wikitech-l List-Post: mailto:wikitech-l@lists.wikimedia.org List-Help: mailto:wikitech-l-request@lists.wikimedia.org?subject=help List-Subscribe: http://lists.wikimedia.org/mailman/listinfo/wikitech-l,mailto:wikitech-l-request@lists.wikimedia.org?subject=subscribe Errors-To: wikitech-l-bounces@lists.wikimedia.org Return-Path: wikitech-l-bounces@lists.wikimedia.org X-OriginalArrivalTime: 19 Aug 2007 20:00:07.0918 (UTC) FILETIME=[8AC2D0E0:01C7E29B]
The question is, do Google, Yahoo!, and/or Microsoft follow rel="nofollow" links?
The answer to this, from a quick Google, appears to be no, yes, maybe respectively. So it will help a bit, but not as much as it might.
Is it just me that finds it ironic that we're solving problems with search engines by Googling them?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
_________________________________________________________________ Live Search delivers results the way you like it. Try live.com now! http://www.live.com
On 8/18/07, Simetrical Simetrical+wikilist@gmail.com wrote:
So there's no way to do any of this without prettified URLs, I don't think, short of listing every possible page. Or you could hack up the code to make URLs look like ?action=edit&title=Foo instead of the reverse.
Way back in the day I separated out index.php and edit.php for exactly this reason. I later changed it to http://mydomain/edit/Title when I went to pretty urls, but the original poster asked what to use "_without_ worrying about making pretty URLs". cp index.php to edit.php and hack up the code to change the URL would work. I don't know if your solution would work or not (don't know if the robots.txt spec allows you to distinguish based on get parameters if they appear at the beginning).
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Anthony wrote:
On 8/18/07, Simetrical Simetrical+wikilist@gmail.com wrote:
So there's no way to do any of this without prettified URLs, I don't think, short of listing every possible page. Or you could hack up the code to make URLs look like ?action=edit&title=Foo instead of the reverse.
Way back in the day I separated out index.php and edit.php for exactly this reason. I later changed it to http://mydomain/edit/Title when I went to pretty urls, but the original poster asked what to use "_without_ worrying about making pretty URLs". cp index.php to edit.php and hack up the code to change the URL would work. I don't know if your solution would work or not
Something like this should do:
$wgActionPaths['edit'] = "/edit.php?title=$1";
where edit.php consists of:
<?php $_REQUEST['action'] = 'edit'; require './index.php'; ?>
(don't know if the robots.txt spec allows you to distinguish based on get parameters if they appear at the beginning).
robots.txt spec is pretty primitive and only allows for specifying complete prefixes.
- -- brion vibber (brion @ wikimedia.org)
On 8/20/07, Brion Vibber brion@wikimedia.org wrote:
Something like this should do:
$wgActionPaths['edit'] = "/edit.php?title=$1";
where edit.php consists of:
<?php $_REQUEST['action'] = 'edit'; require './index.php'; ?>
Wow, why do I totally not know about all these cool customization options. :(
wikitech-l@lists.wikimedia.org