Hi,
On 01/16/2016 06:11 PM, Denny Vrandecic wrote:
To give a bit more thoughts: I am not terribly worried
about current
crawlers. But currently, and more in the future, I expect us to provide
more complex and this expensive APIs: a SPARQL endpoint, parsing APIs, etc.
These will be simply expensive to operate. Not for infrequent users - say,
to the benefit of us 70,000 editors - but for use cases that involve tens
or millions of requests per day. These have the potential of burning a lot
of funds to basically support the operations of commercial companies whose
mission might or might not be aligned with our.
Why do they need to use our APIs? As I understand it, the Wikidata
SPARQL service was designed so that someone could import a Wikidata
dump, and have their own endpoint to query. I'm sure that someone who
has the need to make millions of requests per day also has the technical
resources to set up their own local mirror. I don't think setting up a
MW mirror would be quite so simple, but it should be doable.
One problem with relying on dumps is that downloading them is often
slow, and there are rate limits[1]. If Google or other some other large
entity wanted to donate some hosting space and bandwidth by re-hosting
our dumps, I think that would be a win-win situation all around - they
get their dumps and can directly rsync from us, as well as taking
pressure off of our infrastructure and letting other people access our
content more easily.
[1]
https://phabricator.wikimedia.org/T114019#1892529
-- Legoktm