Hi,
I hope this is the right mailing list to discuss this issue. Some time ago I ran into a series of temporary bans, I thought I managed to tackle this basically by doing a full stop once it gets any response header code other than 200.
However, this seems not to have fixed it, since I received the following message:
"requests.exceptions.HTTPError: 403 Client Error: You have been banned until 2019-10-18T10:21:36.495Z, please respect throttling and retry-after headers. for url: https://query.wikidata.org/sparql"
I am looking into this from scratch and see if I can implement a better solution and certainly one that really respects the retry-after time instead of going full stop.
Whatever I try now, I keep getting 200 headers and I don't want to start an excessive bot run to get into a ban state to see the exact header that the bot needs to respect.
Is there an example of such a header which I can use to make my own test script?
Or is there example python could that successfully deals with a retry-after header?
Regards,
Andra
Hello!
Thanks for trying to not overload the service!
There is some minimal documentation on the throttling done by Wikidata Query Service [1], but it clearly needs to be improved.
High level overview:
Throttling is done by "client". Where client in this case is identified by user-agent and IP address (yes, it is a flawed definition of client, but it mostly works for throttling purpose). Limits are set on the query execution time and on the number of errors raised by the client. When the limits are reached, an HTTP 429 response is sent to the client, with a "Retry-After" HTTP header. This header contains an estimate of how long a client should wait before retrying a request (in seconds). If we see a client that seems to ignore HTTP 429 for long enough, that client is going to be banned for 24 hours.
What you can do:
* don't execute more than one request in parallel * set a user-agent specific to your application (see [2] for some documentation on the user-agent policy) * when receiving an HTTP 429 response, stop for the duration of the Retry-After header or for 1 minute
If you follow all that, you should be good. If you still see throttling / ban, let us know. If you give me the User-Agent of your script and the time at which you received the throttling / ban response, and I can have a look into the logs.
Note that we might have some degenerated behaviour when the service is already overloaded (I don't think so, but who knows).
Good luck!
Guillaume
[1] https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limi... [2] https://meta.wikimedia.org/wiki/User-Agent_policy
On Sat, Nov 2, 2019 at 11:37 AM Andra Waagmeester andra@micel.io wrote:
Hi,
I hope this is the right mailing list to discuss this issue.
Some time ago I ran into a series of temporary bans, I thought I managed to tackle this basically by doing a full stop once it gets any response header code other than 200.
However, this seems not to have fixed it, since I received the following message:
"requests.exceptions.HTTPError: 403 Client Error: You have been banned until 2019-10-18T10:21:36.495Z, please respect throttling and retry-after headers. for url: https://query.wikidata.org/sparql"
I am looking into this from scratch and see if I can implement a better solution and certainly one that really respects the retry-after time instead of going full stop.
Whatever I try now, I keep getting 200 headers and I don't want to start an excessive bot run to get into a ban state to see the exact header that the bot needs to respect.
Is there an example of such a header which I can use to make my own test script?
Or is there example python could that successfully deals with a retry-after header?
Regards,
Andra
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Thanks for your prompt response. I wasn't filtering for 429, but only for 503, so that might explain it. This is my current countermeasure against overloading the system:
https://github.com/SuLab/WikidataIntegrator/blob/v0.4.3/wikidataintegrator/w...
If you follow all that, you should be good. If you still see throttling /
ban, let us know. If you give me the User-Agent of your script and the time at which you received the throttling / ban response, and I can have a look into the logs.
Where do I let you know? Is this email list the right place to do so?
Regards,
Andra
Hello!
Sorry for the late reply.
On Sat, Nov 2, 2019 at 12:31 PM Andra Waagmeester andra@micel.io wrote:
Thanks for your prompt response. I wasn't filtering for 429, but only for 503, so that might explain it. This is my current countermeasure against overloading the system:
https://github.com/SuLab/WikidataIntegrator/blob/v0.4.3/wikidataintegrator/w...
With only a quick look at the code, it looks good enough to me. A few things you might want to improve:
* L1148 [1]: use a default retry_after of 60 seconds instead of 30. That's the upper bound of what our throttling will ask you * L1186-L1189: in case of 429, you can check the "retry-after" header to get a sleep value that will be what our throttling will expect
If you follow all that, you should be good. If you still see throttling /
ban, let us know. If you give me the User-Agent of your script and the time at which you received the throttling / ban response, and I can have a look into the logs.
Where do I let you know? Is this email list the right place to do so?
This list is the right place. Or you can contact me directly if you want. But others might benefit from this discussion being public.
[1] https://github.com/SuLab/WikidataIntegrator/blob/v0.4.3/wikidataintegrator/w... [2] https://github.com/SuLab/WikidataIntegrator/blob/v0.4.3/wikidataintegrator/w...
Regards,
Andra
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hi Andra,
Pywikibot should take care of that for you, see https://github.com/wikimedia/pywikibot/blob/master/pywikibot/comms/http.py#L...
Maarten
On 02-11-19 11:36, Andra Waagmeester wrote:
Hi,
I hope this is the right mailing list to discuss this issue. Some time ago I ran into a series of temporary bans, I thought I managed to tackle this basically by doing a full stop once it gets any response header code other than 200.
However, this seems not to have fixed it, since I received the following message:
"requests.exceptions.HTTPError: 403 Client Error: You have been banned until 2019-10-18T10:21:36.495Z, please respect throttling and retry-after headers. for url: https://query.wikidata.org/sparql"
I am looking into this from scratch and see if I can implement a better solution and certainly one that really respects the retry-after time instead of going full stop.
Whatever I try now, I keep getting 200 headers and I don't want to start an excessive bot run to get into a ban state to see the exact header that the bot needs to respect.
Is there an example of such a header which I can use to make my own test script?
Or is there example python could that successfully deals with a retry-after header?
Regards,
Andra
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
wikidata-tech@lists.wikimedia.org