We have a small propreitary wiki, and we would like to be able to search the entire wiki content daily with sharepoint.
It looks like the easiest way to do that would be to start a crawl at special:allpages site. However, sharepoint immediately stops any such crawl because the site has:
<meta name="robots" content="noindex,nofollow" />
We looked for but can't seem to find any configuration options that we set to include those tags. There is no robots.txt file in the root directory, and we haven't set anything in LocalSettings or DefaultingSettings to prevent robots from following the page (eg. Defaultsettings.php has $wgNamespaceRobotPolicies = array(); and local settings has no robot directives at all)
1) Is this a default setting for the special pages? 2) If it isn't where can we look for things we might have set that we can turn off? 3) If it is, is there anything we can turn on to stop that tag from being put in the page?
If we can't prevent those tags from being inserted, has anyone managed to use the special:export feature with sharepoint? Any articles that might help us solve this problem?
In theory I could write a .net application to read the anchor tags out of the page, then create an .aspx without the noindex, nofollow settings to crawl the pages on special:allpages. But surely, there's an easier way.
Thanks,
Chris
Christopher Desmarais (Contractor) wrote:
We have a small propreitary wiki, and we would like to be able to search the entire wiki content daily with sharepoint.
It looks like the easiest way to do that would be to start a crawl at special:allpages site. However, sharepoint immediately stops any such crawl because the site has:
<meta name="robots" content="noindex,nofollow" />
We looked for but can't seem to find any configuration options that we set to include those tags. There is no robots.txt file in the root directory, and we haven't set anything in LocalSettings or DefaultingSettings to prevent robots from following the page (eg. Defaultsettings.php has $wgNamespaceRobotPolicies = array(); and local settings has no robot directives at all)
- Is this a default setting for the special pages?
It's hard-coded for all special pages.
- If it isn't where can we look for things we might have set that we
can turn off? 3) If it is, is there anything we can turn on to stop that tag from being put in the page?
Index: includes/specials/Allpages.php =================================================================== --- includes/specials/Allpages.php (revision 36353) +++ includes/specials/Allpages.php (working copy) @@ -12,6 +12,8 @@ function wfSpecialAllpages( $par=NULL, $specialPage ) { global $wgRequest, $wgOut, $wgContLang;
+ $wgOut->setRobotPolicy( '' ); + # GET values $from = $wgRequest->getVal( 'from' ); $namespace = $wgRequest->getInt( 'namespace' );
This is https://bugzilla.wikimedia.org/show_bug.cgi?id=8473
The obvious solution is to make $wgArticleRobotPolicies work as advertized and not be overpowered by hardwired code.
Also having users maintain private copies of includes/* usually lasts until the next upgrade only, when different staff don't know about previous tweaks.
Yes the user could also choose to maintain a sitemap, but that is beside the point.
mediawiki-l@lists.wikimedia.org