Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

24 Jul 2019

Hi Stas,

One thing that I've been wondering about is whether we could take a
little bit of load off via caching.

At the moment, if you run the same query again within a minute or two,
it uses the cached results. But after a few minutes, anyone who
follows the link triggers a new run.

If a query is embedded somewhere, or it does the rounds on Twitter or
in the newsletter, it might get a long stream of visitors spread out
enough to miss the cache window, meaning we need to recalculate it a
lot.

For a lot of queries, of course, this is a good thing - we want people
to have the newest data, especially for maintenance queries. But for a
lot of others, either the data isn't going to change in the next day
(eg maps of cities) or it's so high level that being a little old
won't affect much (eg high-level counts of groups of items where all
the results are in the tens of thousands anyway).

So the suggestion: would it be possible to have some kind of
comment/command (similar to #defaultView:Map) that keeps the results
cached for a day or two? This would make it an opt-in approach, and if
this is done as a comment then the user could remove it or tweak the
query to force an update. It certainly wouldn't solve the underlying
load issues - bots aren't likely to want longer cache times - but it
might help take a little bit of the load off.

It might also improve the user experience in some circumstances - if I
email someone a query which I can force to be cached, then I know when
they open it, they'll get something promptly rather than taking a long
time, and (if it's a complex query) I can know for sure it'll run
rather than timing out.

Andrew.

On Tue, 23 Jul 2019 at 08:43, Stas Malyshev &lt;smalyshev(a)wikimedia.org&gt; wrote:
...

 Hello all!

 Here is (at last!) an update on what we are doing to protect the
 stability of Wikidata Query Service.

 For 4 years we have been offering to Wikidata users the Query Service, a
 powerful tool that allows anyone to query the content of Wikidata,
 without any identification needed. This means that anyone can use the
 service using a script and make heavy or very frequent requests.
 However, this freedom has led to the service being overloaded by a too
 big amount of queries, causing the issues or lag that you may have noticed.

 A reminder about the context:

 We have had a number of incidents where the public WDQS endpoint was
 overloaded by bot traffic. We don't think that any of that activity was
 intentionally malicious, but rather that the bot authors most probably
 don't understand the cost of their queries and the impact they have on
 our infrastructure. We've recently seen more distributed bots, coming
 from multiple IPs from cloud providers. This kind of pattern makes it
 harder and harder to filter or throttle an individual bot. The impact
 has ranged from increased update lag to full service interruption.

 What we have been doing:

 While we would love to allow anyone to run any query they want at any
 time, we're not able to sustain that load, and we need to be more
 aggressive in how we throttle clients. We want to be fair to our users
 and allow everyone to use the service productively. We also want the
 service to be available to the casual user and provide up-to-date access
 to the live Wikidata data. And while we would love to throttle only
 abusive bots, to be able to do that we need to be able to identify them.

 We have two main means of identifying bots:

 1) their user agent and IP address
 2) the pattern of their queries

 Identifying patterns in queries is done manually, by a person inspecting
 the logs. It takes time and can only be done after the fact. We can only
 start our identification process once the service is already overloaded.
 This is not going to scale.

 IP addresses are starting to be problematic. We see bots running on
 cloud providers and running their workloads on multiple instances, with
 multiple IP addresses.

 We are left with user agents. But here, we have a problem again. To
 block only abusive bots, we would need those bots to use a clearly
 identifiable user agent, so that we can throttle or block them and
 contact the author to work together on a solution. It is unlikely that
 an intentionally abusive bot will voluntarily provide a way to be
 blocked. So we need to be more aggressive about bots which are using a
 generic user agent. We are not blocking those, but we are limiting the
 number of requests coming from generic user agents. This is a large
 bucket, with a lot of bots that are in this same category of "generic
 user agent". Sadly, this is also the bucket that contains many small
 bots that generate only a very reasonable load. And so we are also
 impacting the bots that play fair.

 At the moment, if your bot is affected by our restrictions, configure a
 custom user agent that identifies you; this should be sufficient to give
 you enough bandwidth. If you are still running into issues, please
 contact us; we'll find a solution together.

 What's coming next:

 First, it is unlikely that we will be able to remove the current
 restrictions in the short term. We're sorry for that, but the
 alternative - service being unresponsive or severely lagged for everyone
 - is worse.

 We are exploring a number of alternatives. Adding authentication to the
 service, and allowing higher quotas to bots that authenticate. Creating
 an asynchronous queue, which could allow running more expensive queries,
 but with longer deadlines. And we are in the process of hiring another
 engineer to work on these ideas.

 Thanks for your patience!

 WDQS Team

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata 

--
- Andrew Gray
  andrew(a)generalist.org.uk

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users