Dear Toolforge cloud people,
I am running the Scholia web application on Toolforge and interested in have some of the pages indexed by search engines. We have '<meta name="robots" content="index, nofollow">' which should index but not crawl the Scholia website.
We have 3 kinds of content on the webpages generated by Scholia:
1) "Static" content generated from Flask jinja2 templates. These gets indexed (but not that much).
2) Dynamic jQuery-based content based on the Wikidata API service. This does not seem to get indexed by some search engines.
3) Dynamic Wikidata Query Serviece-based content. This does not get indexed.
I can understand 1) and 3), but not 2). https://query.wikidata.org/robots.txt is blocking bots request for 3), but as far as I can see https://www.wikidata.org/robots.txt does not block Wikidata API requests for 2).
On a webpage on the public web, I have a link to https://scholia.toolforge.org/author/Q20980928, so I would think that that Scholia page would be indexed, and that the h1 tag that is set via the Wikidata API would be indexed. As far as I can determine the page is indexed at Bing and Quant, but not Duckduckgo and not Google.
I am wondering whether there is anyone that can explain the discrepancy? As far as I understand Google does indeed index jQuery javascript-generated content.
Should we refrain from having bots getting into Scholia and define a restrictive robots.txt to avoid burdening the Toolforge infrastructure too much?
In Scholia, we at the moment have a pull request that implement serverside calls to the Wikidata Query Service to generate some metadata for the search engines that can be reached without hitting the WDQS robots.txt restriction. I have been reluctant to merge that pull request due to the extra load on the Toolforge as well as the extra time the request takes blocking the Scholia web application. We have around 100'000 requests per day according to Toolviews, - how much bot activity I do not know. I am wondering whether there is anyone who can give us advise here?
best regards Finn Årup Nielsen