Thanks for trying to not overload the service!

There is some minimal documentation on the throttling done by Wikidata Query Service [1], but it clearly needs to be improved.

High level overview:

Throttling is done by "client". Where client in this case is identified by user-agent and IP address (yes, it is a flawed definition of client, but it mostly works for throttling purpose). Limits are set on the query execution time and on the number of errors raised by the client. When the limits are reached, an HTTP 429 response is sent to the client, with a "Retry-After" HTTP header. This header contains an estimate of how long a client should wait before retrying a request (in seconds). If we see a client that seems to ignore HTTP 429 for long enough, that client is going to be banned for 24 hours.

What you can do:

* don't execute more than one request in parallel
* set a user-agent specific to your application (see [2] for some documentation on the user-agent policy)
* when receiving an HTTP 429 response, stop for the duration of the Retry-After header or for 1 minute

If you follow all that, you should be good. If you still see throttling / ban, let us know. If you give me the User-Agent of your script and the time at which you received the throttling / ban response, and I can have a look into the logs.

Note that we might have some degenerated behaviour when the service is already overloaded (I don't think so, but who knows).

Good luck!


[1] https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits
[2] https://meta.wikimedia.org/wiki/User-Agent_policy

On Sat, Nov 2, 2019 at 11:37 AM Andra Waagmeester <andra@micel.io> wrote:

    I hope this is the right mailing list to discuss this issue. 
Some time ago I ran into a series of temporary bans, I thought I managed to tackle this basically by doing a full stop once it gets any response header code other than 200. 

However, this seems not to have fixed it, since I received the following message:

"requests.exceptions.HTTPError: 403 Client Error: You have been banned until 2019-10-18T10:21:36.495Z, please respect throttling and retry-after headers. for url: https://query.wikidata.org/sparql"

I am looking into this from scratch and see if I can implement a better solution and certainly one that really respects the retry-after time instead of going full stop. 

Whatever I try now, I keep getting 200 headers and I don't want to start an excessive bot run to get into a ban state to see the exact header that the bot needs to respect. 

Is there an example of such a header which I can use to make my own test script?

Or is there example python could that successfully deals with a retry-after header?



Wikidata-tech mailing list

Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation