Hi all,
We experienced WDQS service disruptions on 2020/07/23. As a result there was a full outage (inability to respond to all queries) for a period of several minutes, and a more extended period of intermittently degraded service (inability to respond to a subset of queries) for 1-2 hours.
The full incident report is available here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200723-wdqs-out...
Ultimately, we traced the proximate cause to a series of non-performant queries, which caused a deadlock in blazegraph, the backend for WDQS. We have placed a temporary block on the IP address in question and are taking steps to better define service availability expectations as well as processes to make detection of these events more streamlined going forward.