Search engine indexing of webapplication - Cloud

23 Nov 2021

Dear Toolforge cloud people,

I am running the Scholia web application on Toolforge and interested in 
have some of the pages indexed by search engines. We have '<meta 
name="robots" content="index, nofollow">' which should index
but not 
crawl the Scholia website.

We have 3 kinds of content on the webpages generated by Scholia:

1) "Static" content generated from Flask jinja2 templates. These gets 
indexed (but not that much).

2) Dynamic jQuery-based content based on the Wikidata API service. This 
does not seem to get indexed by some search engines.

3) Dynamic Wikidata Query Serviece-based content. This does not get 
indexed.

I can understand 1) and 3), but not 2). 
https://query.wikidata.org/robots.txt is blocking bots request for 3), 
but as far as I can see https://www.wikidata.org/robots.txt does not 
block Wikidata API requests for 2).

On a webpage on the public web, I have a link to 
https://scholia.toolforge.org/author/Q20980928, so I would think that 
that Scholia page would be indexed, and that the h1 tag that is set via 
the Wikidata API would be indexed. As far as I can determine the page is 
indexed at Bing and Quant, but not Duckduckgo and not Google.

I am wondering whether there is anyone that can explain the discrepancy? 
As far as I understand Google does indeed index jQuery 
javascript-generated content.

Should we refrain from having bots getting into Scholia and define a 
restrictive robots.txt to avoid burdening the Toolforge infrastructure 
too much?

In Scholia, we at the moment have a pull request that implement 
serverside calls to the Wikidata Query Service to generate some metadata 
for the search engines that can be reached without hitting the WDQS 
robots.txt restriction. I have been reluctant to merge that pull request 
due to the extra load on the Toolforge as well as the extra time the 
request takes blocking the Scholia web application. We have around 
100'000 requests per day according to Toolviews, - how much bot activity 
I do not know. I am wondering whether there is anyone who can give us 
advise here?

best regards
Finn Årup Nielsen