Dear Toolforge cloud people,
I am running the Scholia web application on Toolforge and interested in
have some of the pages indexed by search engines. We have '<meta
name="robots" content="index, nofollow">' which should index
but not
crawl the Scholia website.
We have 3 kinds of content on the webpages generated by Scholia:
1) "Static" content generated from Flask jinja2 templates. These gets
indexed (but not that much).
2) Dynamic jQuery-based content based on the Wikidata API service. This
does not seem to get indexed by some search engines.
3) Dynamic Wikidata Query Serviece-based content. This does not get
indexed.
I can understand 1) and 3), but not 2).
https://query.wikidata.org/robots.txt is blocking bots request for 3),
but as far as I can see
https://www.wikidata.org/robots.txt does not
block Wikidata API requests for 2).
On a webpage on the public web, I have a link to
https://scholia.toolforge.org/author/Q20980928, so I would think that
that Scholia page would be indexed, and that the h1 tag that is set via
the Wikidata API would be indexed. As far as I can determine the page is
indexed at Bing and Quant, but not Duckduckgo and not Google.
I am wondering whether there is anyone that can explain the discrepancy?
As far as I understand Google does indeed index jQuery
javascript-generated content.
Should we refrain from having bots getting into Scholia and define a
restrictive robots.txt to avoid burdening the Toolforge infrastructure
too much?
In Scholia, we at the moment have a pull request that implement
serverside calls to the Wikidata Query Service to generate some metadata
for the search engines that can be reached without hitting the WDQS
robots.txt restriction. I have been reluctant to merge that pull request
due to the extra load on the Toolforge as well as the extra time the
request takes blocking the Scholia web application. We have around
100'000 requests per day according to Toolviews, - how much bot activity
I do not know. I am wondering whether there is anyone who can give us
advise here?
best regards
Finn Årup Nielsen