Hello all!
As you might have seen / endured, we've had a Wikdiata Query Service partial outage yesterday morning (central european time). The full incident report is available [1] if you are interested in the details. The short version:
* a single client started to run an unusually high number of queries on WDQS * the overload was not prevented by our current throttling * the failure was not detected and isolated automatically
To prevent this from happening again, we will review our throttling rules. Those rules were previously tuned to prevent a single client from overloading the service with a small number of expensive requests: we started to log a client activity only when the duration of a request exceeded 10 seconds. Which means that a client sending tons of short requests would never be throttled.
We will correct that by lowering the threshold to probably 25ms. The throttling rules are still the same:
* 60 seconds of processing time per minute (peaking at 120 seconds) * 30 errors per minute (peaking at 60)
If you are using WDQS to make lots of small requests, and you are over the throttling rates above, there is a chance that you will start seeing throttling errors. We are not doing this to bother you, we're just trying to keep another crash from happening...
If you are throttled, you will receive an HTTP 429 error code. This response include the "Retry-After" HTTP header which specify a number of seconds you should wait before retrying.
Thanks for your patience!
And contact me if you want any clarification.
Guillaume
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
Thanks Guilaume,
is this the same accident which caused an hour delay of Wikidata items on Wikipedia watchlists?
Cheers Yaroslav
On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey < glederrey@wikimedia.org> wrote:
Hello all!
As you might have seen / endured, we've had a Wikdiata Query Service partial outage yesterday morning (central european time). The full incident report is available [1] if you are interested in the details. The short version:
- a single client started to run an unusually high number of queries on
WDQS
- the overload was not prevented by our current throttling
- the failure was not detected and isolated automatically
To prevent this from happening again, we will review our throttling rules. Those rules were previously tuned to prevent a single client from overloading the service with a small number of expensive requests: we started to log a client activity only when the duration of a request exceeded 10 seconds. Which means that a client sending tons of short requests would never be throttled.
We will correct that by lowering the threshold to probably 25ms. The throttling rules are still the same:
- 60 seconds of processing time per minute (peaking at 120 seconds)
- 30 errors per minute (peaking at 60)
If you are using WDQS to make lots of small requests, and you are over the throttling rates above, there is a chance that you will start seeing throttling errors. We are not doing this to bother you, we're just trying to keep another crash from happening...
If you are throttled, you will receive an HTTP 429 error code. This response include the "Retry-After" HTTP header which specify a number of seconds you should wait before retrying.
Thanks for your patience!
And contact me if you want any clarification.
Guillaume
[1] https://wikitech.wikimedia.org/wiki/Incident_ documentation/20171018-wdqs [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Yaoslav,
No, but there has been some dispatch issues in the last few days. The current lag for enwiki is 3 hours, for example. You can see a graph of the dispatch lag here: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&... https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1&from=now-7d&to=now
Greetings,
Sjoerd de Bruin sjoerddebruin@me.com
Op 19 okt. 2017, om 11:30 heeft Yaroslav Blanter ymbalt@gmail.com het volgende geschreven:
Thanks Guilaume,
is this the same accident which caused an hour delay of Wikidata items on Wikipedia watchlists?
Cheers Yaroslav
On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey <glederrey@wikimedia.org mailto:glederrey@wikimedia.org> wrote: Hello all!
As you might have seen / endured, we've had a Wikdiata Query Service partial outage yesterday morning (central european time). The full incident report is available [1] if you are interested in the details. The short version:
- a single client started to run an unusually high number of queries on WDQS
- the overload was not prevented by our current throttling
- the failure was not detected and isolated automatically
To prevent this from happening again, we will review our throttling rules. Those rules were previously tuned to prevent a single client from overloading the service with a small number of expensive requests: we started to log a client activity only when the duration of a request exceeded 10 seconds. Which means that a client sending tons of short requests would never be throttled.
We will correct that by lowering the threshold to probably 25ms. The throttling rules are still the same:
- 60 seconds of processing time per minute (peaking at 120 seconds)
- 30 errors per minute (peaking at 60)
If you are using WDQS to make lots of small requests, and you are over the throttling rates above, there is a chance that you will start seeing throttling errors. We are not doing this to bother you, we're just trying to keep another crash from happening...
If you are throttled, you will receive an HTTP 429 error code. This response include the "Retry-After" HTTP header which specify a number of seconds you should wait before retrying.
Thanks for your patience!
And contact me if you want any clarification.
Guillaume
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429 https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST
Wikidata mailing list Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Thanks Sjoerd. Some en-wiki users consider the delay as a (one more) argument that Wikidata is junk and should be thrown down the toilet, so I was curious whether the delay was handled as a part of the problem.
Cheers Yaroslav
On Thu, Oct 19, 2017 at 12:09 PM, Sjoerd de Bruin sjoerddebruin@me.com wrote:
Hi Yaoslav,
No, but there has been some dispatch issues in the last few days. The current lag for enwiki is 3 hours, for example. You can see a graph of the dispatch lag here: https://grafana.wikimedia.org/dashboard/db/ wikidata-dispatch?refresh=1m&orgId=1&from=now-7d&to=now
Greetings,
Sjoerd de Bruin sjoerddebruin@me.com
Op 19 okt. 2017, om 11:30 heeft Yaroslav Blanter ymbalt@gmail.com het volgende geschreven:
Thanks Guilaume,
is this the same accident which caused an hour delay of Wikidata items on Wikipedia watchlists?
Cheers Yaroslav
On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey < glederrey@wikimedia.org> wrote:
Hello all!
As you might have seen / endured, we've had a Wikdiata Query Service partial outage yesterday morning (central european time). The full incident report is available [1] if you are interested in the details. The short version:
- a single client started to run an unusually high number of queries on
WDQS
- the overload was not prevented by our current throttling
- the failure was not detected and isolated automatically
To prevent this from happening again, we will review our throttling rules. Those rules were previously tuned to prevent a single client from overloading the service with a small number of expensive requests: we started to log a client activity only when the duration of a request exceeded 10 seconds. Which means that a client sending tons of short requests would never be throttled.
We will correct that by lowering the threshold to probably 25ms. The throttling rules are still the same:
- 60 seconds of processing time per minute (peaking at 120 seconds)
- 30 errors per minute (peaking at 60)
If you are using WDQS to make lots of small requests, and you are over the throttling rates above, there is a chance that you will start seeing throttling errors. We are not doing this to bother you, we're just trying to keep another crash from happening...
If you are throttled, you will receive an HTTP 429 error code. This response include the "Retry-After" HTTP header which specify a number of seconds you should wait before retrying.
Thanks for your patience!
And contact me if you want any clarification.
Guillaume
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/ 20171018-wdqs [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hello!
As far as I understand, the dispatch lag is an issue between Wikidata and the different Wikipedias. There is no involvement of Wikidata Query Service in this. Sjoerd probably understand that much better than I do...
Note that this issue also caused some replication lag on one of the Wikidata Query Service servers [1]. In that case, this was mitigated by taking that specific server out of rotation and wait for it to recover before sending traffic to it again. And also note that the Wikidata Query Service replication lag is a very different kind of lag than the dispatch lag you were talking about. (yes, all this is complicated).
Thanks for your interest!
[1] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m...
On Thu, Oct 19, 2017 at 2:29 PM, Yaroslav Blanter ymbalt@gmail.com wrote:
Thanks Sjoerd. Some en-wiki users consider the delay as a (one more) argument that Wikidata is junk and should be thrown down the toilet, so I was curious whether the delay was handled as a part of the problem.
Cheers Yaroslav
On Thu, Oct 19, 2017 at 12:09 PM, Sjoerd de Bruin sjoerddebruin@me.com wrote:
Hi Yaoslav,
No, but there has been some dispatch issues in the last few days. The current lag for enwiki is 3 hours, for example. You can see a graph of the dispatch lag here: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&...
Greetings,
Sjoerd de Bruin sjoerddebruin@me.com
Op 19 okt. 2017, om 11:30 heeft Yaroslav Blanter ymbalt@gmail.com het volgende geschreven:
Thanks Guilaume,
is this the same accident which caused an hour delay of Wikidata items on Wikipedia watchlists?
Cheers Yaroslav
On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey glederrey@wikimedia.org wrote:
Hello all!
As you might have seen / endured, we've had a Wikdiata Query Service partial outage yesterday morning (central european time). The full incident report is available [1] if you are interested in the details. The short version:
- a single client started to run an unusually high number of queries on
WDQS
- the overload was not prevented by our current throttling
- the failure was not detected and isolated automatically
To prevent this from happening again, we will review our throttling rules. Those rules were previously tuned to prevent a single client from overloading the service with a small number of expensive requests: we started to log a client activity only when the duration of a request exceeded 10 seconds. Which means that a client sending tons of short requests would never be throttled.
We will correct that by lowering the threshold to probably 25ms. The throttling rules are still the same:
- 60 seconds of processing time per minute (peaking at 120 seconds)
- 30 errors per minute (peaking at 60)
If you are using WDQS to make lots of small requests, and you are over the throttling rates above, there is a chance that you will start seeing throttling errors. We are not doing this to bother you, we're just trying to keep another crash from happening...
If you are throttled, you will receive an HTTP 429 error code. This response include the "Retry-After" HTTP header which specify a number of seconds you should wait before retrying.
Thanks for your patience!
And contact me if you want any clarification.
Guillaume
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Thanks Gillaume for clarification.
Cheers Yaroslav
On Thu, Oct 19, 2017 at 3:06 PM, Guillaume Lederrey <glederrey@wikimedia.org
wrote:
Hello!
As far as I understand, the dispatch lag is an issue between Wikidata and the different Wikipedias. There is no involvement of Wikidata Query Service in this. Sjoerd probably understand that much better than I do...
Note that this issue also caused some replication lag on one of the Wikidata Query Service servers [1]. In that case, this was mitigated by taking that specific server out of rotation and wait for it to recover before sending traffic to it again. And also note that the Wikidata Query Service replication lag is a very different kind of lag than the dispatch lag you were talking about. (yes, all this is complicated).
Thanks for your interest!
[1] https://grafana.wikimedia.org/dashboard/db/wikidata-query- service?refresh=1m&orgId=1&from=now-7d&to=now
On Thu, Oct 19, 2017 at 2:29 PM, Yaroslav Blanter ymbalt@gmail.com wrote:
Thanks Sjoerd. Some en-wiki users consider the delay as a (one more) argument that Wikidata is junk and should be thrown down the toilet, so I was curious whether the delay was handled as a part of the problem.
Cheers Yaroslav
On Thu, Oct 19, 2017 at 12:09 PM, Sjoerd de Bruin sjoerddebruin@me.com wrote:
Hi Yaoslav,
No, but there has been some dispatch issues in the last few days. The current lag for enwiki is 3 hours, for example. You can see a graph of
the
dispatch lag here: https://grafana.wikimedia.org/dashboard/db/wikidata-
dispatch?refresh=1m&orgId=1&from=now-7d&to=now
Greetings,
Sjoerd de Bruin sjoerddebruin@me.com
Op 19 okt. 2017, om 11:30 heeft Yaroslav Blanter ymbalt@gmail.com het volgende geschreven:
Thanks Guilaume,
is this the same accident which caused an hour delay of Wikidata items
on
Wikipedia watchlists?
Cheers Yaroslav
On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey glederrey@wikimedia.org wrote:
Hello all!
As you might have seen / endured, we've had a Wikdiata Query Service partial outage yesterday morning (central european time). The full incident report is available [1] if you are interested in the details. The short version:
- a single client started to run an unusually high number of queries on
WDQS
- the overload was not prevented by our current throttling
- the failure was not detected and isolated automatically
To prevent this from happening again, we will review our throttling rules. Those rules were previously tuned to prevent a single client from overloading the service with a small number of expensive requests: we started to log a client activity only when the duration of a request exceeded 10 seconds. Which means that a client sending tons of short requests would never be throttled.
We will correct that by lowering the threshold to probably 25ms. The throttling rules are still the same:
- 60 seconds of processing time per minute (peaking at 120 seconds)
- 30 errors per minute (peaking at 60)
If you are using WDQS to make lots of small requests, and you are over the throttling rates above, there is a chance that you will start seeing throttling errors. We are not doing this to bother you, we're just trying to keep another crash from happening...
If you are throttled, you will receive an HTTP 429 error code. This response include the "Retry-After" HTTP header which specify a number of seconds you should wait before retrying.
Thanks for your patience!
And contact me if you want any clarification.
Guillaume
documentation/20171018-wdqs
[2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hello all!
Following on this previous communication, the change to our throttling policy has been deployed yesterday (2017-10-23 17:00 UTC). Reviewing the logs so far, I don't see any change of pattern in the number of throttled requests. This means that mostly no one should be affected. Or at least not affected more then you already were.
Feel free to reach out to me if that's not the case.
Have fun!
Guillaume
On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey glederrey@wikimedia.org wrote:
Hello all!
As you might have seen / endured, we've had a Wikdiata Query Service partial outage yesterday morning (central european time). The full incident report is available [1] if you are interested in the details. The short version:
- a single client started to run an unusually high number of queries on WDQS
- the overload was not prevented by our current throttling
- the failure was not detected and isolated automatically
To prevent this from happening again, we will review our throttling rules. Those rules were previously tuned to prevent a single client from overloading the service with a small number of expensive requests: we started to log a client activity only when the duration of a request exceeded 10 seconds. Which means that a client sending tons of short requests would never be throttled.
We will correct that by lowering the threshold to probably 25ms. The throttling rules are still the same:
- 60 seconds of processing time per minute (peaking at 120 seconds)
- 30 errors per minute (peaking at 60)
If you are using WDQS to make lots of small requests, and you are over the throttling rates above, there is a chance that you will start seeing throttling errors. We are not doing this to bother you, we're just trying to keep another crash from happening...
If you are throttled, you will receive an HTTP 429 error code. This response include the "Retry-After" HTTP header which specify a number of seconds you should wait before retrying.
Thanks for your patience!
And contact me if you want any clarification.
Guillaume
[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST