On April 14, Evan Martin wrote:
To answer your specific proposal:
- http://en.wikipedia.org/wiki/Special:Recentchanges has a meta tag:
<meta name="robots" content="noindex,follow" /> which indicates it's explicitly disallowed from being crawled.
As far as I understand the robots meta tag, "noindex,follow" tells robots that they are welcome to fetch the page, that they can find links to other pages here (= follow), but they should never show this page among the search hits (= noindex).
Words such as crawl and index are somewhat fuzzy here. Does "index" mean fetch or does it mean store in an index, to be returned to users as a search hit? I found no clear answer. Of course, the crawler/robot/spider is already fetching the page when it sees the meta tag. And it must fetch the page again to see if the meta tag has changed.
The Pipermail software that is used for the wikitech-l archive sets "noindex,follow" for the overview sorted by date, e.g. http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/date.html but for the individual posting, it sets "index,nofollow", e.g. http://mail.wikimedia.org/pipermail/wikitech-l/2006-April/034969.html
I believe that "noindex,follow" is used for many "sitemap" pages, and this is my idea of how search robots should use RecentChanges.
Indeed, the front page of any newspaper website is also similar to a sitemap. Its content changes so often that it becomes useless to index it under any specific word found there. If people search for "hurricane katrina", they don't want the front page of the Washington Post, which will have changed by the time they arrive. But they might be interested in the news article about this topic, and the front page was the way to harvest the link to that article.
The main difference, then, between the newspaper and Wikipedia is that the newspaper uses their RecentChanges as their front page. Plus the fact that Wikipedia isn't covered by Google News.