[Labs-l] Tools: New default "disallow" for robots.txt

Tim Landscheidt tim at tim-landscheidt.de
Thu Aug 15 02:39:33 UTC 2013


I wrote:

> [...]

> Regarding robots.txt, I've started
> https://gerrit.wikimedia.org/r/77916.  Toolserver's
> robots.txt is:

> | User-agent: msnbot
> | Disallow: /

> | User-agent: *
> | Disallow: /~magnus/geo/geohack.php
> | Disallow: /~daniel/WikiSense
> | Disallow: /~geohack/
> | Disallow: /~enwp10/
> | Disallow: /~cbm/cgi-bin/

> (WikiSense is CatScan IIRC.)  Excluding Geohack is probably
> a good idea.  Do other tool authors have tools they do not
> want to be crawled by search engine bots?

There was (is) a *lot* of crawler traffic to various tools
that are linked to from every article in Wikipedia and which
perform expensive calculations when called.  In an effort to
cope with this, I resorted to robots.txt being:

| User-agent: *
| Disallow: /

for the moment, i. e. disallow *all* crawler accesses *any-
where*.

Reconsidering, this is probably a better default :-).  So:
If your tool *needs* to be crawled by search engine bots or
this causes other problems for you, please speak up.

Also obviously we'd want the central homepage and the list
of tools to be indexed, so this won't the final revision of
robots.txt.

Tim




More information about the Labs-l mailing list