Adding authentication to the service, and allowing higher quotas to bots that authenticate.
 
Awesome and expected.

Creating an asynchronous queue, which could allow running more expensive queries, but with longer deadlines.

Even more awesome!
Will this be approachable:   My 2 hour query will actually finally return results into my 1gig csv.zip file?



On Tue, Jul 23, 2019 at 5:47 AM Amir Sarabadani <amir.sarabadani@wikimedia.de> wrote:
Hey,
Forgive my ignorance. I don't know much about infrastructure of WDQS and how it works. I just want to mention how application servers do it. In appservers, there are dedicated nodes both for apache and the replica database. So if a bot overdo things in Wikipedia (which happens quite a lot), users won't feel anything but the other bots take the hit. Routing based on UA seems hard though while it's easy in mediawiki (if you hit api.php, we assume it's a bot).

Did you consider this in a more long-term solution?
Best

On Tue, 23 Jul 2019 at 09:43, Stas Malyshev <smalyshev@wikimedia.org> wrote:
Hello all!

Here is (at last!) an update on what we are doing to protect the
stability of Wikidata Query Service.

For 4 years we have been offering to Wikidata users the Query Service, a
powerful tool that allows anyone to query the content of Wikidata,
without any identification needed. This means that anyone can use the
service using a script and make heavy or very frequent requests.
However, this freedom has led to the service being overloaded by a too
big amount of queries, causing the issues or lag that you may have noticed.

A reminder about the context:

We have had a number of incidents where the public WDQS endpoint was
overloaded by bot traffic. We don't think that any of that activity was
intentionally malicious, but rather that the bot authors most probably
don't understand the cost of their queries and the impact they have on
our infrastructure. We've recently seen more distributed bots, coming
from multiple IPs from cloud providers. This kind of pattern makes it
harder and harder to filter or throttle an individual bot. The impact
has ranged from increased update lag to full service interruption.

What we have been doing:

While we would love to allow anyone to run any query they want at any
time, we're not able to sustain that load, and we need to be more
aggressive in how we throttle clients. We want to be fair to our users
and allow everyone to use the service productively. We also want the
service to be available to the casual user and provide up-to-date access
to the live Wikidata data. And while we would love to throttle only
abusive bots, to be able to do that we need to be able to identify them.

We have two main means of identifying bots:

1) their user agent and IP address
2) the pattern of their queries

Identifying patterns in queries is done manually, by a person inspecting
the logs. It takes time and can only be done after the fact. We can only
start our identification process once the service is already overloaded.
This is not going to scale.

IP addresses are starting to be problematic. We see bots running on
cloud providers and running their workloads on multiple instances, with
multiple IP addresses.

We are left with user agents. But here, we have a problem again. To
block only abusive bots, we would need those bots to use a clearly
identifiable user agent, so that we can throttle or block them and
contact the author to work together on a solution. It is unlikely that
an intentionally abusive bot will voluntarily provide a way to be
blocked. So we need to be more aggressive about bots which are using a
generic user agent. We are not blocking those, but we are limiting the
number of requests coming from generic user agents. This is a large
bucket, with a lot of bots that are in this same category of "generic
user agent". Sadly, this is also the bucket that contains many small
bots that generate only a very reasonable load. And so we are also
impacting the bots that play fair.

At the moment, if your bot is affected by our restrictions, configure a
custom user agent that identifies you; this should be sufficient to give
you enough bandwidth. If you are still running into issues, please
contact us; we'll find a solution together.

What's coming next:

First, it is unlikely that we will be able to remove the current
restrictions in the short term. We're sorry for that, but the
alternative - service being unresponsive or severely lagged for everyone
- is worse.

We are exploring a number of alternatives. Adding authentication to the
service, and allowing higher quotas to bots that authenticate. Creating
an asynchronous queue, which could allow running more expensive queries,
but with longer deadlines. And we are in the process of hiring another
engineer to work on these ideas.

Thanks for your patience!

WDQS Team

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


--
Amir Sarabadani (he/him)
Software engineer

Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Tel. (030) 219 158 26-0
https://wikimedia.de

Unsere Vision ist eine Welt, in der alle Menschen am Wissen der Menschheit teilhaben, es nutzen und mehren können. Helfen Sie uns dabei!
https://spenden.wikimedia.de

Wikimedia Deutschland — Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata