Hi,
On 01/16/2016 06:11 PM, Denny Vrandecic wrote:
To give a bit more thoughts: I am not terribly worried about current crawlers. But currently, and more in the future, I expect us to provide more complex and this expensive APIs: a SPARQL endpoint, parsing APIs, etc. These will be simply expensive to operate. Not for infrequent users - say, to the benefit of us 70,000 editors - but for use cases that involve tens or millions of requests per day. These have the potential of burning a lot of funds to basically support the operations of commercial companies whose mission might or might not be aligned with our.
Why do they need to use our APIs? As I understand it, the Wikidata SPARQL service was designed so that someone could import a Wikidata dump, and have their own endpoint to query. I'm sure that someone who has the need to make millions of requests per day also has the technical resources to set up their own local mirror. I don't think setting up a MW mirror would be quite so simple, but it should be doable.
One problem with relying on dumps is that downloading them is often slow, and there are rate limits[1]. If Google or other some other large entity wanted to donate some hosting space and bandwidth by re-hosting our dumps, I think that would be a win-win situation all around - they get their dumps and can directly rsync from us, as well as taking pressure off of our infrastructure and letting other people access our content more easily.
[1] https://phabricator.wikimedia.org/T114019#1892529
-- Legoktm