It might be useful to include a robots.txt file that'll tell search spiders not to bother with any of the active pages such as 'Edit'. While it isn't hard to make this kind of file it could be useful to include it for the sake of giving people a starting place. As most wiki's would have the same robots.txt file anyway. On most of my sites anyway they get hit by spiders several times a day so keeping spiders from wasting time on pages they don't need to index minimizes the wasted server time.
On Oct 26, 2004, at 4:04 PM, Michael wrote:
It might be useful to include a robots.txt file that'll tell search spiders not to bother with any of the active pages such as 'Edit'.
This will be dependent on your server configuration. Note that robots.txt works on URL prefixes, so you need a reliable way of distinguishing plain view hits from other URLs.
(The meta tags already tell search engines not to index edit pages and other special pages, and not to continue spidering from them, but won't prevent the initial hit to load that page.)
-- brion vibber (brion @ pobox.com)
This will be dependent on your server configuration. Note that robots.txt works on URL prefixes, so you need a reliable way of distinguishing plain view hits from other URLs.
I'd just make the example work if the wiki is in the root folder for the site. That'd be enough to give most people a starting place if nothing else. And I think you can distinguish the ones that need to be ignored by the '?' in the URL. Are there any pages that should be ignored that don't have the '?'?
(The meta tags already tell search engines not to index edit pages and other special pages, and not to continue spidering from them, but won't prevent the initial hit to load that page.)
Many search engines ignore those meta tags.
On Oct 26, 2004, at 4:37 PM, Michael wrote:
This will be dependent on your server configuration. Note that robots.txt works on URL prefixes, so you need a reliable way of distinguishing plain view hits from other URLs.
I'd just make the example work if the wiki is in the root folder for the site. That'd be enough to give most people a starting place if nothing else. And I think you can distinguish the ones that need to be ignored by the '?' in the URL. Are there any pages that should be ignored that don't have the '?'?
Every single page will have a ? if you're running without PATH_INFO support or rewrite rules.
(The meta tags already tell search engines not to index edit pages and other special pages, and not to continue spidering from them, but won't prevent the initial hit to load that page.)
Many search engines ignore those meta tags.
Such as?
-- brion vibber (brion @ pobox.com)
mediawiki-l@lists.wikimedia.org