On 4/14/06, Lars Aronsson lars@aronsson.se wrote:
I'm not sure if you're talking about the big web search engines (Google, Yahoo, MSN) or the search function in Mediawiki here. There is little excuse for the latter to have any delay. But even for a big web search engine, it is easy to keep track of how often each webpage has changed in history, and economize how often it needs to be revisited. Combined with the high PageRank of en.wp's RecentChanges (9 of 10), it would be trivial for Googlebot to revisit this page (or the front page of websites of major newspapers) every minute or two and make it a high priority to reindex all pages linked from there. I suppose this is how Google News works. Why it still takes about a month for Google to update its index on Wikipedia articles is a mystery to me. Probably it has to do with a lack of competition. If MSN or Yahoo were faster, it would force Google to improve.
I know this was intentionally provocative, but I'll bite anyway.
As far as I know, the limitation in general on Google indexing more of wikipedia is that wikipedia can't serve pages fast enough (or, more accurately, the extra load of more Googlebot will make wikipedia slower).
To answer your specific proposal: 1) http://en.wikipedia.org/wiki/Special:Recentchanges has a meta tag: <meta name="robots" content="noindex,follow" /> which indicates it's explicitly disallowed from being crawled. 2) If it were allowed to be crawled, I'd expect it to be regularly updated for the reasons you describe. But even in that case, this particular page changing rapidly is not an indicator that the target pages are also changing rapidly. For example, I imagine that the digg front page changes pretty much every time a crawler visits, but the pages linked *from* digg are not necessarily changing any more rapidly than any other random page on the web is changing.
Instead, there is a way for webmasters and Google to cooperate: the sitemaps program. You can read more about it here: https://www.google.com/webmasters/sitemaps/docs/en/about.html