On 4/14/06, Lars Aronsson <lars(a)aronsson.se> wrote:
I'm not sure if you're talking about the big
web search engines
(Google, Yahoo, MSN) or the search function in Mediawiki here.
There is little excuse for the latter to have any delay. But even
for a big web search engine, it is easy to keep track of how often
each webpage has changed in history, and economize how often it
needs to be revisited. Combined with the high PageRank of en.wp's
RecentChanges (9 of 10), it would be trivial for Googlebot to
revisit this page (or the front page of websites of major
newspapers) every minute or two and make it a high priority to
reindex all pages linked from there. I suppose this is how Google
News works. Why it still takes about a month for Google to update
its index on Wikipedia articles is a mystery to me. Probably it
has to do with a lack of competition. If MSN or Yahoo were
faster, it would force Google to improve.
I know this was intentionally provocative, but I'll bite anyway.
As far as I know, the limitation in general on Google indexing more of
wikipedia is that wikipedia can't serve pages fast enough (or, more
accurately, the extra load of more Googlebot will make wikipedia
slower).
To answer your specific proposal:
1)
http://en.wikipedia.org/wiki/Special:Recentchanges has a meta tag:
<meta name="robots" content="noindex,follow" />
which indicates it's explicitly disallowed from being crawled.
2) If it were allowed to be crawled, I'd expect it to be regularly
updated for the reasons you describe. But even in that case, this
particular page changing rapidly is not an indicator that the target
pages are also changing rapidly. For example, I imagine that the digg
front page changes pretty much every time a crawler visits, but the
pages linked *from* digg are not necessarily changing any more rapidly
than any other random page on the web is changing.
Instead, there is a way for webmasters and Google to cooperate: the
sitemaps program. You can read more about it here:
https://www.google.com/webmasters/sitemaps/docs/en/about.html