I noticed that the Googlebot does not use the exact article path from the sitemap files but uses Special:RecentchangesLinked instead to access the articles. So instead of /www.mysite.com/wiki/My_Article/, Google uses /www.mysite.com/wiki/Special:RecentchangesLinked/My_Article/. Why is that?
I ran into this because I disallowed //wiki/Special:RecentchangesLinked/ within robots.txt. This was to prevent the bots from indexing all revisions.
Regards, Rene Bakker
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Rene Bakker wrote:
I noticed that the Googlebot does not use the exact article path from the sitemap files but uses Special:RecentchangesLinked instead to access the articles. So instead of /www.mysite.com/wiki/My_Article/, Google uses /www.mysite.com/wiki/Special:RecentchangesLinked/My_Article/. Why is that?
It'll follow whatever links it sees. Given that there is such a link on every article page, I'm sure it happily follows them as well as the article pages.
I ran into this because I disallowed //wiki/Special:RecentchangesLinked/ within robots.txt. This was to prevent the bots from indexing all revisions.
Disallowing RecentchangesLinked won't affect indexing of revisions.
Note that Recentchanges and Recentchangeslinked both have meta robots tags telling spiders not to index them or follow links from them. This does not affect whether they get _spidered_ -- unless robots.txt blocks those URLs, search spiders will reach those pages through regular links.
Old versions also have meta robots tags telling spiders not to index them. Whether they get _spidered_ in the first place is up to your robots.txt configuration, and what other pages on the web link to them, and what their meta robots settings are.
- -- brion
mediawiki-l@lists.wikimedia.org