Hoi, I make use of the SourceMD environment, it is well behaved allows for throttling and when I have multiple jobs it only runs one at a time. I do understand that my jobs are put on hold when the situation warrants it, I even put them myself on hold when I think about it.
When someone else puts my job on hold, I cannot release them at a better time and I now have seven jobs doing nothing. A new job progresses normally. The point is that management is ok but given that what I do is well behaved, I expect my jobs to run and when held to be released at a later time. When I cannot depend on jobs to finish, my work is not finished and I do not know if I should run more jobs and what jobs to get the data to a finished state. Thanks, GerardM
On Tue, 18 Jun 2019 at 06:35, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
We are currently dealing with a bot overloading the Wikidata Query Service. This bot does not look actively malicious, but does create enough load to disrupt the service. As a stop gap measure, we had to deny access to all bots using python-request user agent.
As a reminder, any bot should use a user agent that allows to identify it [1]. If you have trouble accessing WDQS, please check that you are following those guidelines.
To add to this, we have had this trouble because two events that WDQS currently does not deal well with have coincided:
- An edit bot that edited with 200+ edits per minute. This is too much.
Over 60/m is really almost always too much. And also it would be a good thing to consider if your bots does multiple changes (e.g. adds multiple statements) doing it in one call instead of several, since WDQS currently will do an update on each change separately, and this may be expensive. We're looking into various improvements to this, but it is the state currently.
- Several bots have been flooding the service query endpoint with
requests. There is recently a growth in bots that a) completely ignore both regular limits and throttling hints b) do not have proper identifying user agent and c) use distributed hosts so our throttling system has a problem to deal with them automatically. We intend to crack down more and more on such clients, because they look a lot like DDOS and ruin the service experience for everyone.
I will write down more detailed rules probably a bit later, but so far these:
https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Usage_c... and additionally having distinct User-Agent if you're running a bot is a good idea.
And for people who are thinking it's a good idea to launch a max-requests-I-can-stuff-into-the-pipe bot, put it on several Amazon machines so that throttling has hard time detecting it, and then when throttling does detect it neglecting to check for a week that all the bot is doing is fetching 403s from the service and wasting everybody's time - please think again. If you want to do something non-trivial querying WDQS and limits get in the way - please talk to us (and if you know somebody who isn't reading this list but is considering wiring a bot interfacing with WDQS - please educate them and refer them for help, we really prefer to help than to ban). Otherwise, we'd be forced to put more limitations on it that will affect everyone.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata